搞定網頁爬取和數據提取？Crawl4AI帶你體驗高效AI Agent工作流程

發布于 2024-11-4 11:51

瀏覽

0收藏

嘿，大家好！這里是一個專注于AI智能體的頻道！

今天我要跟大家分享一個超級棒的開源工具——Crawl4AI。這個工具簡直是構建AI Agent的福音，它自動化了網頁爬取和數據提取的過程，讓開發者們能更高效地構建智能Agent來收集和分析信息。

首先，Crawl4AI是完全開源且免費的，這意味著開發者們可以無門檻地使用它。它的核心亮點是AI驅動，能夠自動識別和解析網頁元素，大大節省了我們的時間和精力。而且，Crawl4AI還能將提取的數據轉換成結構化的格式，比如JSON或markdown，讓數據分析變得簡單多了。

接下來，我給大家簡單介紹一下如何使用Crawl4AI。首先，你需要安裝它，命令很簡單，一行代碼就搞定。然后，創建一個Python腳本，初始化網絡爬蟲，從URL提取數據。Crawl4AI還支持滾動瀏覽、多個URL爬取、媒體標簽提取、元數據提取，甚至是截圖功能，功能非常全面。

from crawl4ai import WebCrawler

crawler = WebCrawler()
crawler.warmup()
result = crawler.run(url="https://openai.com/api/pricing/")
print(result.markdown)

重點來了，Crawl4AI還能用大型語言模型（LLM）來定義提取策略，把提取的數據轉換成結構化格式。這意味著，你可以根據需要定制數據提取的規則，讓Crawl4AI按照你的指示去抓取網頁上的信息。

mport os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(..., description="Fee for output token ?for the OpenAI model.")

url = 'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()

result = crawler.run(
        url=url,
        word_count_threshold=1,
        extraction_strategy= LLMExtractionStrategy(
            provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), 
            schema=OpenAIModelFee.schema(),
            extraction_type="schema",
            instruction="""從爬取的內容中，提取所有提到的模型名稱以及它們的輸入和輸出token費用。  不要遺漏整個內容中的任何模型。一個提取的模型JSON格式應如下所示：  
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}。"""
        ),            
        bypass_cache=True,
    )

print(result.extracted_content)

更厲害的是，Crawl4AI可以和Praison CrewAI集成，讓數據的處理更加高效。你可以創建一個工具文件，包裝Crawl工具，然后配置AI Agent使用Crawl進行網頁抓取和數據提取。

舉個例子，你可以設置一個AI Agent，它的角色是網頁抓取專家，專門負責從網上抓取模型定價信息。另一個Agent可能是數據清洗專家，確保收集的數據準確無誤，格式規范。還有一個Agent是數據分析專家，專注于從數據中提取有價值的洞察。

import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
from praisonai_tools import BaseTool

class ModelFee(BaseModel):
    llm_model_name: str = Field(..., description="Name of the model.")
    input_fee: str = Field(..., description="Fee for input token for the model.")
    output_fee: str = Field(..., description="Fee for output token for the model.")

class ModelFeeTool(BaseTool):
    name: str = "ModelFeeTool"
    description: str = "從給定的定價頁面中提取模型的輸入和輸出token費用。 "

    def _run(self, url: str):
        crawler = WebCrawler()
        crawler.warmup()

        result = crawler.run(
            url=url,
            word_count_threshold=1,
            extraction_strategy= LLMExtractionStrategy(
                provider="openai/gpt-4o",
                api_token=os.getenv('OPENAI_API_KEY'), 
                schema=ModelFee.schema(),
                extraction_type="schema",
                instruction="""從爬取的內容中，提取所有提到的模型名稱以及它們的輸入和輸出token費用。  不要遺漏整個內容中的任何模型。一個提取的模型JSON格式應如下所示：  
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}。"""
            ),            
            bypass_cache=True,
        )
        return result.extracted_content

if __name__ == "__main__":
    # Test the ModelFeeTool
    tool = ModelFeeTool()
    url = "https://www.openai.com/pricing"
    result = tool.run(url)
    print(result)

配置yaml

framework: crewai
topic: extract model pricing from websites
roles:
  web_scraper:
    backstory: 一個網絡爬蟲專家，對從在線資源中提取結構化數據有深刻的理解。https://openai.com/api/pricing/ https://www.anthropic.com/pricing https://cohere.com/pricing
    goal: 從各種網站收集模型定價數據
    role: Web Scraper
    tasks:
      scrape_model_pricing:
        description: 從提供的網站列表中抓取模型定價信息。
        expected_output: 包含模型定價數據的原始HTML或JSON。
    tools:
    - 'ModelFeeTool'
  data_cleaner:
    backstory: 數據清洗專家，確保所有收集的數據準確無誤且格式正確。
    goal: 清洗并整理抓取到的定價數據
    role: Data Cleaner
    tasks:
      clean_pricing_data:
        description: 處理原始抓取數據，刪除任何重復項和不一致項，并將其轉換為結構化格式。
        expected_output: 包含模型定價的已清洗且已整理的JSON或CSV文件
          data.
    tools:
    - ''
  data_analyzer:
    backstory: 數據分析專家，專注于從結構化數據中獲取可操作的見解。
    goal: 分析已清洗的定價數據以提取見解
    role: Data Analyzer
    tasks:
      analyze_pricing_data:
        description: 分析已清洗的數據，提取模型定價的趨勢、模式和見解。
        expected_output: 總結模型定價趨勢和見解的詳細報告。
    tools:
    - ''
dependencies: []

總之，Crawl4AI是一個強大的工具，它讓AI Agent能夠以更高的效率和準確性執行網頁爬取和數據提取任務。它的開源特性、AI驅動的能力以及多功能性，對于想要構建智能且數據驅動的Agent的開發者來說，絕對是一個寶貴的資源。

本文轉載自??探索AGI??，作者：獼猴桃 ????

標簽

Crawl4AI

工具

Agent

贊

回復