Paper Reading LIDA
TLDR
LIDA is a novel tool for generating grammar-agnostic visualizations and infographics. It addresses several key challenges in automatic visualization creation:
- Understanding data semantics
- Enumerating visualization goals
- Generating visualization specifications
The system uses a multi-stage pipeline powered by large language models (LLMs) and image generation models (IGMs). It consists of 4 main modules:
- SUMMARIZER: Converts data into compact natural language summaries
- GOAL EXPLORER: Identifies potential visualization goals based on the data
- VISGENERATOR: Handles visualization code generation, refinement and filtering
- INFOGRAPHER: Creates data-driven stylized graphics using IGMs
LIDA provides both a Python API and an interactive user interface supporting direct manipulation and multilingual natural language for generating charts, infographics and data stories.
Paper
Notes
Process
By following the instructions in the GitHub repository, we can clone the source code and set up LIDA to run locally in our development environment. This allows us to experiment with and explore the capabilities of LIDA firsthand.
-
Select a visualization library/grammar and configure the large language model. Since the default models are deprecated, you'll need to choose a new model from the available options.
-
Select from the example datasets or upload your own custom data for visualization.
-
Review the automatically generated data summary to understand the key characteristics and patterns in your dataset.
-
Choose a visualization goal from the suggestions generated based on your data summary. T
-
Review the visualization automatically generated to match your selected goal.
-
Refine and customize the visualization through interactive editing tools.
Code
How does LIDA work? It might not be as complex as you imagine.
Manager Class
# Visualization manager class that handles the visualization of the data with the following methods
# summarize data given a df
# generate goals given a summary
# generate generate visualization specifications given a summary and a goal
# execute the specification given some data
def __init__(self, text_gen: TextGenerator = None) -> None:
"""
Initialize the Manager object.
Args:
text_gen (TextGenerator, optional): Text generator object. Defaults to None.
"""
self.text_gen = text_gen or llm()
self.summarizer = Summarizer()
self.goal = GoalExplorer()
self.vizgen = VizGenerator()
self.vizeditor = VizEditor()
self.executor = ChartExecutor()
self.explainer = VizExplainer()
self.evaluator = VizEvaluator()
self.repairer = VizRepairer()
self.recommender = VizRecommender()
self.data = None
self.infographer = None
self.persona = PersonaExplorer()
Prompt Design
Summarize
You are an experienced data analyst that can annotate datasets. Your instructions are as follows:
i) ALWAYS generate the name of the dataset and the dataset_description
ii) ALWAYS generate a field description.
iii.) ALWAYS generate a semantic_type (a single word) for each field given its values e.g. company, city, number, supplier, location, gender, longitude, latitude, url, ip address, zip code, email, etc
You must return an updated JSON dictionary without any preamble or explanation.
Summarize data given a DataFrame or file path.
Args:
data (Union[pd.DataFrame, str]): Input data, either a DataFrame or file path.
file_name (str, optional): Name of the file if data is loaded from a file path. Defaults to "".
n_samples (int, optional): Number of summary samples to generate. Defaults to 3.
summary_method (str, optional): Summary method to use. Defaults to "default".
textgen_config (TextGenerationConfig, optional): Text generation configuration. Defaults to TextGenerationConfig(n=1, temperature=0).
Returns:
Summary: Summary object containing the generated summary.
Example of Summary:
{
'name': 'cars.csv',
'file_name': 'cars.csv',
'dataset_description': '',
'fields': [
{
'column': 'Name',
'properties': {
'dtype': 'string',
'samples': [
'Nissan Altima S 4dr',
'Mercury Marauder 4dr',
'Toyota Prius 4dr (gas/electric)'
],
'num_unique_values': 385,
'semantic_type': '',
'description': ''
}
},
{
'column': 'Type',
'properties': {
'dtype': 'category',
'samples': ['SUV', 'Minivan', 'Sports Car'],
'num_unique_values': 5,
'semantic_type': '',
'description': ''
}
},
{
'column': 'AWD',
'properties': {
'dtype': 'number',
'std': 0,
'min': 0,
'max': 1,
'samples': [1, 0],
'num_unique_values': 2,
'semantic_type': '',
'description': ''
}
}
]
}
"""
Goal
You are a an experienced data analyst who can generate a given number of insightful GOALS about data, when given a summary of the data, and a specified persona. The VISUALIZATIONS YOU RECOMMEND MUST FOLLOW VISUALIZATION BEST PRACTICES (e.g., must use bar charts instead of pie charts for comparing quantities) AND BE MEANINGFUL (e.g., plot longitude and latitude on maps where appropriate). They must also be relevant to the specified persona. Each goal must include a question, a visualization (THE VISUALIZATION MUST REFERENCE THE EXACT COLUMN FIELDS FROM THE SUMMARY), and a rationale (JUSTIFICATION FOR WHICH dataset FIELDS ARE USED and what we will learn from the visualization). Each goal MUST mention the exact fields from the dataset summary above
The number of GOALS to generate is {n}. The goals should be based on the data summary below,
The generated goals SHOULD BE FOCUSED ON THE INTERESTS AND PERSPECTIVE of a '{persona.persona} persona, who is insterested in complex, insightful goals about the data.
Generate goals based on a summary.
Args:
summary (Summary): Input summary.
textgen_config (TextGenerationConfig, optional): Text generation configuration. Defaults to TextGenerationConfig().
n (int, optional): Number of goals to generate. Defaults to 5.
persona (Persona, str, dict, optional): Persona information. Defaults to None.
Returns:
List[Goal]: List of generated goals.
Example of list of goals:
Goal 0
Question: What is the distribution of Retail_Price?
Visualization: histogram of Retail_Price
Rationale: This tells about the spread of prices of cars in the dataset.
Goal 1
Question: What is the distribution of Horsepower_HP_?
Visualization: box plot of Horsepower_HP_
Rationale: This tells about the distribution of horsepower of cars in the dataset.
THE OUTPUT MUST BE A CODE SNIPPET OF A VALID LIST OF JSON OBJECTS. IT MUST USE THE FOLLOWING FORMAT:
```
[
{ "index": 0, "question": "What is the distribution of X", "visualization": "histogram of X", "rationale": "This tells about "} ..
]
```
THE OUTPUT SHOULD ONLY USE THE JSON FORMAT ABOVE.
VizGenerator
You are a helpful assistant highly skilled in writing PERFECT code for visualizations. Given some code template, you complete the template to generate a visualization given the dataset and the goal described. The code you write MUST FOLLOW VISUALIZATION BEST PRACTICES ie. meet the specified goal, apply the right transformation, use the right visualization type, use the right data encoding, and use the right aesthetics (e.g., ensure axis are legible). The transformations you apply MUST be correct and the fields you use MUST be correct. The visualization CODE MUST BE CORRECT and MUST NOT CONTAIN ANY SYNTAX OR LOGIC ERRORS (e.g., it must consider the field types and use them correctly). You MUST first generate a brief plan for how you would solve the task e.g. what transformations you would apply e.g. if you need to construct a new column, what fields you would use, what visualization type you would use, what aesthetics you would use, etc. .
Always add a legend with various colors where appropriate. The visualization code MUST only use data fields that exist in the dataset (field_names) or fields that are transformations based on existing field_names). Only use variables that have been defined in the code or are in the dataset summary. You MUST return a FULL PYTHON PROGRAM ENCLOSED IN BACKTICKS ``` that starts with an import statement. DO NOT add any explanation. \n\n THE GENERATED CODE SOLUTION SHOULD BE CREATED BY MODIFYING THE SPECIFIED PARTS OF THE TEMPLATE BELOW \n\n {library_template} \n\n.The FINAL COMPLETED CODE BASED ON THE TEMPLATE above is ... \n\n
This is an excellent prompt design that we can learn from.
- Well-structured task description with clear goals and requirements
- Clear and concise language that avoids ambiguity
- Explicit role definition for the AI assistant
- Few-shot prompting
- Precise specification of output format and structure