數據分析自動化:LIDA智能可視化的魔法! 原創
01 概述
在這個數據驅動的時代,我們每天都在產生和處理海量的數據。如何從這些數據中提取有價值的信息,并以一種直觀、易于理解的方式呈現,成為了一個重要的課題。今天,給大家介紹一個強大的工具——Language-Integrated Data Analysis(LIDA),它能夠自動化地創建可視化圖表,讓數據洞察變得觸手可及。
02 LIDA的核心特性
語法無關的可視化
無論你是Python、R還是C++的開發者,LIDA都能幫助你產出視覺輸出,而無需鎖定在特定的編程語言中。這種靈活性讓來自不同編程背景的用戶都能輕松上手。
多階段生成流程
LIDA通過一個無縫的工作流程,從數據總結到可視化創建,幫助用戶輕松駕馭復雜的數據集。
混合用戶界面
LIDA提供了直接操作和多語言自然語言界面的選項,使得從數據科學家到商業分析師的廣泛受眾都能輕松使用。用戶可以通過自然語言命令進行交互,使數據可視化變得直觀而簡單。
03 LIDA的架構
LIDA的架構包括以下幾個關鍵組件:
- Summarizer:將數據集轉換為簡潔的自然語言描述,包括所有列名、分布等信息。
- GOAL Explorer:基于數據集識別潛在的可視化或分析目標,并生成用戶指定數量的目標。
- Viz Generator:根據數據集上下文和指定目標自動生成創建可視化的代碼。
- Infographer:創建、評估、完善并執行可視化代碼,以產生完全風格化的規范。
04 LIDA的主要特點
- 數據總結:LIDA將大型數據集壓縮成密集的自然語言摘要,作為未來操作的基礎。
- 自動化數據探索:LIDA提供了一個完全自動化的模式,用于基于不熟悉的數據集生成有意義的可視化目標。
- 信息圖表生成:使用圖像生成模型將數據轉換為風格化的、吸引人的信息圖表,用于個性化的故事講述。
- VizOps – 可視化操作:對生成的可視化進行詳細操作,增強可訪問性、數據素養和調試。
- 可視化解釋:提供可視化代碼的深入描述,幫助無障礙使用、教育和理解。
- 自我評估:使用大型語言模型(LLMs)根據最佳實踐為可視化生成多維評估分數。
- 可視化修復:使用自我評估或用戶提供的反饋自動改進或修復可視化。
- 可視化推薦:根據上下文或現有可視化推薦額外的可視化,以便比較或增加視角。
05 LIDA實戰
安裝
使用pip安裝:
pip install lida
# 設定對應的api keyexport OPENAI_API_KEY=<API_KEY>
也可以.env來進行api key管理:
from dotenv import load_env
import os load_dotenv()
# read the .env file
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
LIDA 功能詳解
- 初始化
from lida import Manager, TextGenerationConfig , llm
from lida.utils import plot_raster
import warnings
from dotenv import load_dotenv
import os
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
warnings.filterwarnings("ignore")
# 初始化 LIDA
lida = Manager(text_gen = llm("openai", api_key=str(OPENAI_API_KEY))) # !! input your openai or other LLM api key
textgen_config = TextGenerationConfig(n=1, temperature=0.5, model="gpt-3.5-turbo-0301", use_cache=True)
lida.Manager 是 LIDA Lib 中的 Controller,負責設置 LLM 的類型;而 lida.TextGenerationConfig 則是對生成內容的詳細設置,包括生成次數 n、生成參數溫度的變化程度、模型和 use_cache 都在這里設置。
- 導入數據
import pandas as pd
# 資料目前是使用官方推薦的資料集
cars data = pd.read_csv("<https://raw.githubusercontent.com/uwdata/draco/master/data/cars.csv>") data.head()
- 數據摘要
從數據集生成簡要摘要;內容分別為每個專欄的std, min, max, samples, unique, semantic_type和description
# 數據摘要:從資料集生成簡短摘要
summary = lida.summarize( "https://raw.githubusercontent.com/uwdata/draco/master/data/cars.csv" , summary_method= "default" , textgen_cnotallow=textgen_config)
print (summary)
- 目標生成
根據資料摘要輸出,包括Index, Question, Visualizations 和Rationale。
# 目標生成:根據資料摘要生成視覺化圖表的目標, n=3 表示生成3 個目標
goals = lida.goals(summary, n= 3 , textgen_cnotallow=textgen_config)
# 查看目前要生成的目標
for goal in goals:
print ( "=" * 20 )
print ( f"Question: {goal.index} " )
# print the question, visualization and rationale with each goal
print (goal.question)
print (goal.visualization)
print (goal.rationale)
```輸出結果
====================
Question: 0
What is the distribution of Retail_Price?
histogram of Retail_Price
This tells about the spread of prices of cars in the dataset .
====================
Question: 1
What is the distribution of Engine_Size__l_ among different car types?
box plot of Engine_Size__l_ for each car type
This will help in identifying if there is any difference in engine size among different car types.
====================
Question: 2
What is the relationship between Horsepower_HP_ and City_Miles_Per_Gallon?
scatter plot of Horsepower_HP_ vs City_Miles_Per_Gallon
This will help in identifying if there is any correlation between horsepower and fuel efficiency of cars.
- 生成可視化圖表
根據Goal 的visualization 建議自動生成圖表。
library = "matplotlib" # 可選"altair", "seaborn", "plotly", "matplotlib"
textgen_config = TextGenerationConfig(n= 1 , temperature= 0.2 , use_cache= True )
for i in range ( len (goals)):
# print the question, visualization and rationale with each goal
print ( "Question: " , goals[i].question)
print ( "Visualization: " , goals[i].visualization)
print ( "Rationale: " , goals[i] .rationale)
charts = lida.visualize(summary=summary, goal=goals[i], textgen_cnotallow=textgen_config, library=library)
plot_raster(charts[ 0 ].raster)
- 圖表編輯
使用自然語言(NLP)編輯圖表,例如顏色、字的大小甚至字型等等。(這個在寫論文或研究報告時感覺很實用XD )
# 改變圖表顏色和字體大小
instructions = [ "change the color to red " , "scale the word size to 50%" ]
edited_charts = lida.edit(code=charts[ 0 ].code, summary=summary, instructinotallow=instructions )
plot_raster(edited_charts[ 0 ].raster)
- 視覺化圖表解釋
code = charts[ 0 ].code
explanations = lida.explain(code=code, library=library, textgen_cnotallow=textgen_config)
for row in explanations[ 0 ]:
print (row[ "section" ], " ** " , row[ "explanation" ])
# 輸出結果
accessibility ** The code creates a scatter plot using the matplotlib.pyplot library to visualize the relationship between two variables - Horsepower_HP_ and City_Miles_Per_Gallon. The plot is colored blue with an alpha value of 0.5 to show the density of the data points. The x-axis is labeled 'Horsepower_HP_' and the y-axis is labeled 'City_Miles_Per_Gallon' . The title of the plot is 'What is the relationship between Horsepower_HP_ and City_Miles_Per_Gallon?' .
transformation ** There is no data transformation happening in this code. The plot is made using the original data as it is .
visualization ** The code first imports the required libraries - matplotlib.pyplot and pandas. The function plot() takes a pandas DataFrame as input and creates a scatter plot using the plt.scatter() method. The x-axis of the plot is the 'Horsepower_HP_' column of the input DataFrame and the y-axis is the 'City_Miles_Per_Gallon' column of the input DataFrame. The alpha parameter controls the transparency of the data points and the color parameter sets the color of the data points. The plt.xlabel() and plt.ylabel() methods add labels to the x-axis and y-axis respectively. The plt.title() method adds a title to the plot. The wrap parameter in plt.title() is set to True to wrap the title text if it exceeds the width of the plot. Finally, the function returns the plot object .
- 可視化評估和修復
評估視覺化圖表是否存在問題,評分標準包括:Bug 錯誤, Transformation 轉換程度, Compliance 合規性, type 圖表類別, encoding 編碼方式和aesthetics 美觀程度;令人最意外的居然可以評估美觀程度XDD
evaluations = lida.evaluate(code=code, goal=goals[i], library=library)[ 0 ]
for eval in evaluations:
print ( eval [ "dimension" ], "Score" , eval [ "score" ], " / 10" )
print ( "\t" , eval [ "rationale" ][: 200 ])
print ( "\t*********************** ***********" )
# 輸出結果
bugs Score 10 / 10
No bugs, syntax errors, or typos found.
***************** *****************
transformation Score 10 / 10
No data transformation needed for a scatter plot.
******************* ***************
compliance Score 8 / 10
The code meets the specified visualization goal, but the title could be improved by removing the question mark and rephrasing it as a statement.
**** ******************************
type Score 9 / 10
A scatter plot is an appropriate visualization type for exploring the relationship between two continuous variables.
**********************************
encoding Score 9 / 10
The data is encoded appropriately with Horsepower_HP_ on the x-axis and City_Miles_Per_Gallon on the y-axis.
**********************************
aesthetics Score 9 / 10
The aesthetics of the visualization are appropriate with a blue color and an alpha of 0.5 to show overlapping points. ***************************** *****
- 可視化圖表推薦
針對Summary 的上下文生成對應數量、由LLM 判斷的推薦圖表。
textgen_config = TextGenerationConfig(n= 1 , temperature= 0 , use_cache= True )
recommended_charts = lida.recommend(code=code, summary=summary, n= 3 , textgen_cnotallow=textgen_config)
print ( f"Recommended { len (recommended_charts)} charts " )
for chart in recommended_charts:
plot_raster(chart.raster)
pass
- 個性化圖表生成
# 先繼承class 'lida.datamodel.Goal'
from lida.datamodel import Goal
# datamodel 總共有4 個object,分別是index, question, visualization and rationale
custom_goal = Goal(
index= 0 ,
questinotallow= "What is the distribution of the Type?" ,
visualizatinotallow= "Bar Chart" ,
ratinotallow= "The type of the car is an important feature of the dataset."
)
# 生成圖表
custom_chart = lida.visualize(summary=summary, goal=custom_goal, textgen_cnotallow=textgen_config , library=library)
plot_raster(custom_chart[ 0 ].raster)
# 編輯客制化生成圖表
custom_instructions = [ "change the color to blue tone on tone color" ] # 改變Bar Chart 的顏色
edited_custom_charts = lida.edit(code= custom_chart[ 0 ].code, summary=summary, instructinotallow=custom_instructions)
plot_raster(edited_custom_charts[ 0 ].raster)
Web UI
目前LIDA 官方有推出一個Web UI 可以讓大家上傳自己的資料進行分析,使用方法如下:?
pip install lida
export OPENAI_API_KEY=<your key>
lida ui --port=8080 --docs
!!注意事項:
- 資料集大小:LIDA 目前適合小規模的資料集,因為目前LLM 沒法處理太長的文章(Token 長度)。
- LLM 選擇:LIDA 與GPT 3.5, GPT 4,最為相容,因為Summary 維度較高的資料和進行推理時還是需要比較大的LLM 才有較好的成效。
- 可靠性:論文中顯示錯誤率低于3.5%、但在輸出圖表還是反覆檢查一下結果是否合理。
參考:
本文轉載自公眾號Halo咯咯 作者:基咯咯
原文鏈接: ??https://mp.weixin.qq.com/s/smeYr8cUi3yqXYm4jBz7Wg???
