Model test report

This notebook provides example EDSL code for running test questions with language models of your choice and generating a model performance report. It is the same test that is run daily to update the model pricing and availability page at Coop, a free platform for creating and sharing AI-based research: https://www.expectedparrot.com/home/report.

The text questions are designed to show whether a given model is live (not deprecated) and capable of answering a simple question, and also whether it is a vision model, capable of recognizing a simple image (a picture of the Expected Parrot logo). The questions are readily editable and can be modified to test other content of your choice, e.g., more complicated questions or question types, or other data types. EDSL comes with a variety of methods for automatically adding different types of content to your surveys, including PNG, PDS, CSV, tables, dictionaries, lists, etc., which we also demonstrate below. We recommend using and modifying the code as needed to individually test the models, questions, scenarios and other components of your job before running a larger job.

To learn more about each of the objects and methods used below please see the EDSL documentation page.

If you have questions or need help, please post a message to our Discord channel or send an email to info@expected parrot.com.

[1]:

from edsl import (
    Cache,
    FileStore,
    Model,
    ModelList,
    QuestionMultipleChoice,
    QuestionFreeText,
    Scenario,
    ScenarioList,
    Survey,
    Results
)

Test questions

Here we create a survey of simple questions to test each model’s ability to answer a question and recognize an image. Modify the questions and question types as needed to test a model’s ability to complete your own survey. (If you add other question types, be sure to import them above.)

[2]:

q1 = QuestionMultipleChoice(
    question_name = "capital_of_france",
    question_text = "What is the capital of France?",
    question_options = ["Paris", "London", "Berlin"],
)

q2 = QuestionFreeText(
    question_name = "image_description",
    question_text = "Describe what you see in this image: {{ image }}",
)

survey = Survey(questions=[q1, q2])

Scenarios

The second test question above uses a {{ placeholder }} for content to be added to the question when it is run. EDSL comes with a variety of methods for automatically adding “scenarios” or content to your questions from different data and file types, including PNG, PDF, CSV, text, dictionarities, lists, tables, etc.

Here we add an image in order for our test report to identify vision models capable of recognizing a simple picture (the Expected Parrot logo). First, we use the FileStore module to `post the image to Coop <>`__ and make it accessible to anyone using this notebook to access it. (Content is unlisted by default; you can make content public or private from your workspace or at the web app. Learn more about using the filestore and sharing content at Coop.)

Then we retrieve it and use it in a Scenario. (We could also use multiple images or other content at once. Learn more about using scenarios to parameterize questions or add metadata to your surveys.) Modify the steps below to use other content with your questions.

Note: You must have a Coop account in order to post and retrieve content. To run this test with local content only, simply create a scenario and add it to the survey.

To post a file to Coop and use it in a scenario:

[3]:

# fs = FileStore(path = "path/to/file.png") # update with local filename
# scenario = Scenario({"image": fs}) # use the parameter key from your question
# scenario_list = ScenarioList(data=[scenario])

To retrieve any available file at Coop to use in a scenario:

[4]:

fs = FileStore.pull(uuid = "d6f7e806-1e36-42f0-8979-ccbd4d180b29") # parrot logo image - update Coop uuid for other images
scenario = Scenario({"image": fs}) # update to match the parameter key used in your question
scenario_list = ScenarioList(data=[scenario])

Update job

Create or update the job as needed if there are any edits to the survey questions or scenarios that have been created. (Delete the scenario list if not being used, or replace with new scenarios that have been created.)

[5]:

job = survey.by(scenario_list)

Run the test

The code below shows how to see a list of available services and specify which ones you want to test. You can also check whether you currently have a local key stored for any service.

Note: You must have local keys stored for the services that you want to test. Otherwise, you can modify the test code inputs to run the test remotely.

To see a list of all services:

[6]:

Model.services()

[6]:

	Service Name	Local key?
0	openai	yes
1	anthropic
2	deep_infra
3	google	yes
4	groq
5	bedrock
6	azure
7	ollama
8	test	yes
9	together
10	perplexity
11	mistral

Change this parameter to False if you want to test the models remotely (i.e., not using your own keys for language models):

[7]:

disable_remote_inference = True

Update this code to identify the services that you want to test with your keys. All the models for the service will be tested:

[8]:

services_to_test = ['google']

Specify a filename for the test results that will be generated:

[9]:

filename = "test_model_report.csv"

Code for the test:

[10]:

import csv
import math
import time
from datetime import timedelta
from typing import Optional, List, Dict

[11]:

class ModelTest:
    def __init__(self):
        pass

    def get_model_to_services_mapping(
        self, available_models: list[list[str, str]]
    ) -> dict:
        """
        Returns a mapping of model names to their available inference services.
        """
        model_to_services = {}
        for item in available_models:
            model_name = item[0]
            inference_service = item[1]
            if model_name in model_to_services:
                model_to_services[model_name].append(inference_service)
            else:
                model_to_services[model_name] = [inference_service]

        return model_to_services

    def get_unique_services(self, available_models: list[list[str, str]]) -> list[str]:
        """
        Retrieves the list of unique services.
        """
        unique_services = set()
        for item in available_models:
            service_name = item[1]
            unique_services.add(service_name)
        return list(unique_services)

    def get_model_list(
        self, available_models: list[list[str, str]], services_to_run: list[str]
    ) -> ModelList:
        """
        Returns the EDSL ModelList object with the models that we want to test.
        """
        models = []
        for model_data in available_models:
            model_name = model_data[0]
            service_name = model_data[1]

            if service_name in services_to_run:
                m = Model(service_name=service_name, model_name=model_name)
                models.append(m)

        model_list = ModelList(data=models)
        return model_list

    def run_job(
        self, available_models: list[list[str, str]], services_to_run: list[str]
    ) -> Results:
        """
        Runs the test job.
        """
        model_list = self.get_model_list(available_models, services_to_run)

        results = (
            survey.by(scenario_list)
            .by(model_list)
            .run(
                cache=Cache(),
                disable_remote_cache=True,
                disable_remote_inference=disable_remote_inference,
                print_exceptions=False,
            )
        )

        return results

    def get_inference_service(
        self, model_name: str, model_to_services: dict
    ) -> str | None:
        """
        Maps a model to a single inference service, removing that service from all other models.
        Returns the selected inference service for the model, or None if no service available.
        """
        # If model doesn't exist or has no services, return None
        if model_name not in model_to_services or not model_to_services[model_name]:
            return None

        # Take the first available service for this model
        selected_service = model_to_services[model_name][0]

        # Remove this service from the model's available services
        if selected_service in model_to_services[model_name]:
            model_to_services[model_name].remove(selected_service)

        return selected_service

    def parse_exceptions(self, exceptions_dict: dict, field_name: str) -> str | None:
        """
        Parses exceptions for a specific field from the exceptions dictionary.
        Returns a joined string of unique exceptions. If there are no exceptions, returns None.
        """
        if field_name not in exceptions_dict:
            return None

        unique_exceptions = []
        for exception in exceptions_dict[field_name]:
            exception_data = exception["exception"]
            formatted_exception = (
                f"{exception_data['type']}: {exception_data['message']}"
            )
            if formatted_exception not in unique_exceptions:
                unique_exceptions.append(formatted_exception)

        if unique_exceptions:
            return "\n".join(unique_exceptions)
        else:
            return None

    def parse_results_dict(self, results: dict, model_to_services: dict) -> list[dict]:
        """
        Parses the results dictionary and returns a list of dictionaries with the results.
        """
        records = []

        for key in results.keys():
            if key == "data":
                data = results[key]
                for index, item in enumerate(data):
                    task_history = results.get("task_history")
                    if task_history is not None:
                        exceptions_dict = task_history["interviews"][index][
                            "exceptions"
                        ]

                        capital_of_france_exceptions_string = self.parse_exceptions(
                            exceptions_dict, "capital_of_france"
                        )
                        image_description_exceptions_string = self.parse_exceptions(
                            exceptions_dict, "image_description"
                        )
                    else:
                        capital_of_france_exceptions_string = None
                        image_description_exceptions_string = None

                    records.append(
                        {
                            "inference_service": self.get_inference_service(
                                model_name=item["model"]["model"],
                                model_to_services=model_to_services,
                            ),
                            "model": item["model"]["model"],
                            "answer_capital_of_france": item["answer"][
                                "capital_of_france"
                            ],
                            "answer_image_description": item["answer"][
                                "image_description"
                            ],
                            "exceptions_capital_of_france": capital_of_france_exceptions_string,
                            "exceptions_image_description": image_description_exceptions_string,
                            "works_with_text": item["answer"]["capital_of_france"]
                            == "Paris",
                            "works_with_images": type(
                                item["answer"]["image_description"]
                            )
                            == str
                            and "parrot" in item["answer"]["image_description"],
                        }
                    )
        return records


    def save_to_file(self, results: 'Results', records: List[Dict], filename="test_model_report.csv"):
        """
        Saves the results of this test to a CSV file.
        """
        if records:
            # Determine the fieldnames from the keys of the first record
            fieldnames = records[0].keys()

            with open(filename, "w", newline='') as f:
                writer = csv.DictWriter(f, fieldnames=fieldnames)

                # Write the header row
                writer.writeheader()

                # Write each dictionary as a row in the CSV
                for record in records:
                    writer.writerow(record)



    def run_test(
        self, services: Optional[list[str]] = None
    ):
        """
        Runs the test, parses the results, and saves to a file.
        """
        try:
            print("Running model test...")

            start_time = time.time()

            available_models = Model.available()

            unique_services = self.get_unique_services(available_models)

            if services is None:
                services_to_run = unique_services
            else:
                services_to_run = services

            try:
                services_to_run.remove("azure")
            except ValueError:
                pass

            results = self.run_job(available_models, services_to_run)

            end_time = time.time()

            runtime = end_time - start_time
            runtime_td = timedelta(seconds=runtime)
            runtime_in_seconds = runtime_td.total_seconds()

            print(
                f"Finished running model test. Runtime: {runtime_in_seconds:.3f} seconds"
            )

            print("Parsing results...")

            model_to_services = self.get_model_to_services_mapping(available_models)
            records = self.parse_results_dict(results.to_dict(), model_to_services)

            print("Finished parsing results.")
            print(f"Saving to {'file'}...")

            self.save_to_file(results, records)

            print(f"Finished saving to {'file'}.")

        except Exception as e:
            print("Exception running model test:", str(e))

Running the test:

[12]:

test_instance = ModelTest()
test_instance.run_test(services=services_to_test)

Running model test...

/Users/a16174/edsl/edsl/inference_services/AvailableModelFetcher.py:139: UserWarning: No models found for service ollama
  warnings.warn(f"No models found for service {service_name}")

Finished running model test. Runtime: 50.373 seconds
Parsing results...
Finished parsing results.
Saving to file...
Finished saving to file.

Posting this notebook to Coop

Here we also demonstrate how to post any object to Coop, such as this notebook:

[15]:

from edsl import Notebook

n = Notebook("model_test_report.ipynb")
n.push(description = "Run a model test report", visibility = "public")

[15]:

{'description': 'Run a model test report',
 'object_type': 'notebook',
 'url': 'https://www.expectedparrot.com/content/a7131421-c409-46f8-9059-21197501969c',
 'uuid': 'a7131421-c409-46f8-9059-21197501969c',
 'version': '0.1.40.dev1',
 'visibility': 'public'}

To update an object at Coop:

[13]:

from edsl import Notebook

n = Notebook("model_test_report.ipynb") # resave the object
n.patch(uuid = "a7131421-c409-46f8-9059-21197501969c", value = n) # specify the Coop uuid

[13]:

{'status': 'success'}