Paper Reading LIDA

April 21, 2024 · 7 min read

TLDR

LIDA is a novel tool for generating grammar-agnostic visualizations and infographics. It addresses several key challenges in automatic visualization creation:

Understanding data semantics
Enumerating visualization goals
Generating visualization specifications

The system uses a multi-stage pipeline powered by large language models (LLMs) and image generation models (IGMs). It consists of 4 main modules:

SUMMARIZER: Converts data into compact natural language summaries
GOAL EXPLORER: Identifies potential visualization goals based on the data
VISGENERATOR: Handles visualization code generation, refinement and filtering
INFOGRAPHER: Creates data-driven stylized graphics using IGMs

LIDA provides both a Python API and an interactive user interface supporting direct manipulation and multilingual natural language for generating charts, infographics and data stories.

Paper

Notes

【Origin Code from Github】

Process

By following the instructions in the GitHub repository, we can clone the source code and set up LIDA to run locally in our development environment. This allows us to experiment with and explore the capabilities of LIDA firsthand.

Select a visualization library/grammar and configure the large language model. Since the default models are deprecated, you'll need to choose a new model from the available options.
Select from the example datasets or upload your own custom data for visualization.
Review the automatically generated data summary to understand the key characteristics and patterns in your dataset.
Choose a visualization goal from the suggestions generated based on your data summary. T
Review the visualization automatically generated to match your selected goal.
Refine and customize the visualization through interactive editing tools.

Code

How does LIDA work? It might not be as complex as you imagine.

Manager Class

# Visualization manager class that handles the visualization of the data with the following methods

# summarize data given a df
# generate goals given a summary
# generate generate visualization specifications given a summary and a goal
# execute the specification given some data


def __init__(self, text_gen: TextGenerator = None) -> None:
        """
        Initialize the Manager object.

        Args:
            text_gen (TextGenerator, optional): Text generator object. Defaults to None.
        """

        self.text_gen = text_gen or llm()

        self.summarizer = Summarizer()
        self.goal = GoalExplorer()
        self.vizgen = VizGenerator()
        self.vizeditor = VizEditor()
        self.executor = ChartExecutor()
        self.explainer = VizExplainer()
        self.evaluator = VizEvaluator()
        self.repairer = VizRepairer()
        self.recommender = VizRecommender()
        self.data = None
        self.infographer = None
        self.persona = PersonaExplorer()

Prompt Design

Summarize

You are an experienced data analyst that can annotate datasets. Your instructions are as follows:
i) ALWAYS generate the name of the dataset and the dataset_description
ii) ALWAYS generate a field description.
iii.) ALWAYS generate a semantic_type (a single word) for each field given its values e.g. company, city, number, supplier, location, gender, longitude, latitude, url, ip address, zip code, email, etc
You must return an updated JSON dictionary without any preamble or explanation.

Summarize data given a DataFrame or file path.

        Args:
            data (Union[pd.DataFrame, str]): Input data, either a DataFrame or file path.
            file_name (str, optional): Name of the file if data is loaded from a file path. Defaults to "".
            n_samples (int, optional): Number of summary samples to generate. Defaults to 3.
            summary_method (str, optional): Summary method to use. Defaults to "default".
            textgen_config (TextGenerationConfig, optional): Text generation configuration. Defaults to TextGenerationConfig(n=1, temperature=0).

        Returns:
            Summary: Summary object containing the generated summary.

        Example of Summary:
            {
                'name': 'cars.csv',
                'file_name': 'cars.csv',
                'dataset_description': '',
                'fields': [
                    {
                        'column': 'Name',
                        'properties': {
                            'dtype': 'string',
                            'samples': [
                                'Nissan Altima S 4dr',
                                'Mercury Marauder 4dr',
                                'Toyota Prius 4dr (gas/electric)'
                            ],
                            'num_unique_values': 385,
                            'semantic_type': '',
                            'description': ''
                        }
                    },
                    {
                        'column': 'Type',
                        'properties': {
                            'dtype': 'category',
                            'samples': ['SUV', 'Minivan', 'Sports Car'],
                            'num_unique_values': 5,
                            'semantic_type': '',
                            'description': ''
                        }
                    },
                    {
                        'column': 'AWD',
                        'properties': {
                            'dtype': 'number',
                            'std': 0,
                            'min': 0,
                            'max': 1,
                            'samples': [1, 0],
                            'num_unique_values': 2,
                            'semantic_type': '',
                            'description': ''
                        }
                    }
                ]
            }

        """

Goal

You are a an experienced data analyst who can generate a given number of insightful GOALS about data, when given a summary of the data, and a specified persona. The VISUALIZATIONS YOU RECOMMEND MUST FOLLOW VISUALIZATION BEST PRACTICES (e.g., must use bar charts instead of pie charts for comparing quantities) AND BE MEANINGFUL (e.g., plot longitude and latitude on maps where appropriate). They must also be relevant to the specified persona. Each goal must include a question, a visualization (THE VISUALIZATION MUST REFERENCE THE EXACT COLUMN FIELDS FROM THE SUMMARY), and a rationale (JUSTIFICATION FOR WHICH dataset FIELDS ARE USED and what we will learn from the visualization). Each goal MUST mention the exact fields from the dataset summary above

The number of GOALS to generate is {n}. The goals should be based on the data summary below,
The generated goals SHOULD BE FOCUSED ON THE INTERESTS AND PERSPECTIVE of a '{persona.persona} persona, who is insterested in complex, insightful goals about the data.

Generate goals based on a summary.

Args:
    summary (Summary): Input summary.
    textgen_config (TextGenerationConfig, optional): Text generation configuration. Defaults to TextGenerationConfig().
    n (int, optional): Number of goals to generate. Defaults to 5.
    persona (Persona, str, dict, optional): Persona information. Defaults to None.

Returns:
    List[Goal]: List of generated goals.

Example of list of goals:

    Goal 0
    Question: What is the distribution of Retail_Price?

    Visualization: histogram of Retail_Price

    Rationale: This tells about the spread of prices of cars in the dataset.

    Goal 1
    Question: What is the distribution of Horsepower_HP_?

    Visualization: box plot of Horsepower_HP_

    Rationale: This tells about the distribution of horsepower of cars in the dataset.

THE OUTPUT MUST BE A CODE SNIPPET OF A VALID LIST OF JSON OBJECTS. IT MUST USE THE FOLLOWING FORMAT:

```
   [
      { "index": 0,  "question": "What is the distribution of X", "visualization": "histogram of X", "rationale": "This tells about "} ..
   ]
```

THE OUTPUT SHOULD ONLY USE THE JSON FORMAT ABOVE.

VizGenerator

You are a helpful assistant highly skilled in writing PERFECT code for visualizations. Given some code template, you complete the template to generate a visualization given the dataset and the goal described. The code you write MUST FOLLOW VISUALIZATION BEST PRACTICES ie. meet the specified goal, apply the right transformation, use the right visualization type, use the right data encoding, and use the right aesthetics (e.g., ensure axis are legible). The transformations you apply MUST be correct and the fields you use MUST be correct. The visualization CODE MUST BE CORRECT and MUST NOT CONTAIN ANY SYNTAX OR LOGIC ERRORS (e.g., it must consider the field types and use them correctly). You MUST first generate a brief plan for how you would solve the task e.g. what transformations you would apply e.g. if you need to construct a new column, what fields you would use, what visualization type you would use, what aesthetics you would use, etc. .

Always add a legend with various colors where appropriate. The visualization code MUST only use data fields that exist in the dataset (field_names) or fields that are transformations based on existing field_names). Only use variables that have been defined in the code or are in the dataset summary. You MUST return a FULL PYTHON PROGRAM ENCLOSED IN BACKTICKS ``` that starts with an import statement. DO NOT add any explanation. \n\n THE GENERATED CODE SOLUTION SHOULD BE CREATED BY MODIFYING THE SPECIFIED PARTS OF THE TEMPLATE BELOW \n\n {library_template} \n\n.The FINAL COMPLETED CODE BASED ON THE TEMPLATE above is ... \n\n

This is an excellent prompt design that we can learn from.

Well-structured task description with clear goals and requirements
Clear and concise language that avoids ambiguity
Explicit role definition for the AI assistant
Few-shot prompting
Precise specification of output format and structure

Paper​

Notes​

Process​

Code​

Manager Class​

Prompt Design​

Summarize​

Goal​

VizGenerator​