🚀🤖 Crawl4AI：专为LLM设计的开源网络爬虫与数据提取工具

Crawl4AI是GitHub趋势榜排名第一的项目，由活跃社区持续维护。它为LLM、AI智能体和数据管道提供极速、AI就绪的网络爬取能力。作为开源工具，Crawl4AI兼具灵活性和实时性能，为开发者提供无与伦比的速度、精度和部署便捷性。

🎉 版本0.6.0现已发布！ 此候选版本引入了地理感知爬取、表格转DataFrame提取、浏览器预热池、网络控制台流量捕获、MCP AI工具集成，以及全面重构的Docker部署！阅读发布说明 →

🤓 我的个人故事

我的计算机之旅始于童年，当时作为计算机科学家的父亲向我介绍了Amstrad电脑。这段早期经历激发了我对技术的热爱，促使我在研究生阶段专攻自然语言处理。正是在那时，我第一次涉足网络爬虫领域，构建工具帮助研究人员整理论文和从出版物中提取信息——这段充满挑战又收获颇丰的经历磨练了我的数据提取技能。

时间来到2023年，我在为一个项目开发工具时需要将网页转换为Markdown的爬虫。在探索解决方案时，我发现一个声称开源的工具却要求创建账户并生成API令牌。更糟的是，它实际上是收费16美元的SaaS服务，且质量不尽人意。出于沮丧，我意识到这是个更深层的问题。这种挫败感转化为动力，我决定自己构建解决方案。短短几天内，我创造了Crawl4AI。出乎意料的是，它迅速走红，获得了数千GitHub星标，引起了全球社区的共鸣。

我将Crawl4AI开源有两个原因。首先，这是我对职业生涯中一直支持我的开源社区的回报。其次，我相信数据应该对所有人开放，而不是被付费墙封锁或被少数人垄断。数据的开放获取为AI民主化奠定了基础，让每个人都能训练自己的模型并掌握自己的信息。这个库是创建有史以来最佳开源数据提取和生成工具的第一步，将由充满热情的社区共同构建。

感谢所有支持这个项目、使用它并提供反馈的人。你们的鼓励激励我追求更大的梦想。加入我们，提交问题，发起PR，或传播这个项目。我们可以共同打造真正赋能人们获取自身数据、重塑AI未来的工具。

🧐 为什么选择Crawl4AI？

专为LLM设计：生成智能、简洁的Markdown，优化RAG和微调应用
闪电速度：实时高效，速度提升6倍
灵活浏览器控制：提供会话管理、代理和自定义钩子实现无缝数据访问
启发式智能：采用先进算法高效提取，减少对昂贵模型的依赖
开源可部署：完全开源，无需API密钥，支持Docker和云集成
活跃社区：由充满活力的社区积极维护，GitHub趋势榜排名第一

🚀 快速开始

安装Crawl4AI：

# Install the package
pip install -U crawl4ai

# For pre release versions
pip install crawl4ai --pre

# Run post-installation setup
crawl4ai-setup

# Verify your installation
crawl4ai-doctor

如果遇到浏览器相关问题，可以手动安装：

python -m playwright install --with-deps chromium

使用Python运行简单网页爬取：

import asyncio
from crawl4ai import *

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

或使用新的命令行界面：

# Basic crawl with markdown output
crwl https://www.nbcnews.com/business -o markdown

# Deep crawl with BFS strategy, max 10 pages
crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10

# Use LLM extraction with a specific question
crwl https://www.example.com/products -q "Extract all product prices"

✨ 功能特性

📝 Markdown生成

🧹 纯净Markdown：生成结构清晰、格式准确的Markdown
🎯 精炼Markdown：基于启发式过滤去除噪音和无关内容，适合AI处理
🔗 引用与参考文献：将页面链接转换为带清晰引用的编号参考列表
🛠️ 自定义策略：用户可创建针对特定需求的Markdown生成策略
📚 BM25算法：采用BM25过滤提取核心信息，去除无关内容

📊 结构化数据提取

🤖 LLM驱动提取：支持所有LLM（开源和专有）进行结构化数据提取
🧱 分块策略：实现主题型、正则表达式、句子级分块进行针对性内容处理
🌌 余弦相似度：基于用户查询查找相关内容块进行语义提取
🔎 CSS选择器提取：使用XPath和CSS选择器快速提取基于模式的数据
🔧 模式定义：定义自定义模式从重复结构中提取结构化JSON

🌐 浏览器集成

🖥️ 托管浏览器：使用用户自有浏览器完全控制，避免机器人检测
🔄 远程浏览器控制：连接Chrome开发者工具协议实现远程大规模数据提取
👤 浏览器配置管理：创建和管理持久化配置，保存认证状态、cookies和设置
🔒 会话管理：保持浏览器状态并复用于多步骤爬取
🧩 代理支持：无缝连接带认证的代理实现安全访问
⚙️ 完整浏览器控制：修改headers、cookies、用户代理等定制爬取设置
🌍 多浏览器支持：兼容Chromium、Firefox和WebKit
📐 动态视口调整：自动调整浏览器视口匹配页面内容，确保完整渲染和元素捕获

🔎 爬取与提取

🖼️ 媒体支持：提取图片、音频、视频及响应式图片格式如srcset和picture
🚀 动态爬取：执行JS并等待异步或同步内容提取
📸 截图功能：爬取过程中捕获页面截图用于调试或分析
📂 原始数据爬取：直接处理原始HTML(raw:)或本地文件(file://)
🔗 全面链接提取：提取内部、外部链接及嵌入式iframe内容
🛠️ 可定制钩子：在每个步骤定义钩子自定义爬取行为
💾 缓存机制：缓存数据提高速度，避免重复获取
📄 元数据提取：从网页检索结构化元数据
📡 IFrame内容提取：无缝提取嵌入式iframe内容
🕵️ 懒加载处理：等待图片完全加载，确保不遗漏懒加载内容
🔄 全页扫描：模拟滚动加载并捕获所有动态内容，完美支持无限滚动页面

🚀 部署

🐳 Docker化部署：优化Docker镜像搭配FastAPI服务器实现轻松部署
🔑 安全认证：内置JWT令牌认证保障API安全
🔄 API网关：一键部署带安全令牌认证的API工作流
🌐 可扩展架构：专为大规模生产环境优化服务器性能
☁️ 云部署：提供主流云平台即用部署配置

🎯 附加功能

🕶️ 隐身模式：模拟真实用户避免机器人检测
🏷️ 标签内容提取：基于自定义标签、headers或元数据优化爬取
🔗 链接分析：提取并分析所有链接进行详细数据探索
🛡️ 错误处理：稳健的错误管理确保无缝执行
🔐 CORS与静态服务：支持基于文件系统的缓存和跨域请求
📖 清晰文档：简化并持续更新的入门和高级使用指南
🙌 社区认可：透明公开地表彰贡献者和PR提交者

立即试用！

✨ 体验

✨ 访问文档网站

安装指南 🛠️

Crawl4AI提供灵活的安装选项适应不同使用场景。您可以选择Python包安装或Docker部署。

🐍 使用pip

根据需求选择安装方式：

基础安装

适用于基础网页爬取任务：

pip install crawl4ai
crawl4ai-setup # Setup the browser

默认安装异步版Crawl4AI，使用Playwright进行网页爬取。

👉 注意：安装Crawl4AI时，crawl4ai-setup应自动安装并配置Playwright。如遇Playwright相关问题，可手动安装：

通过命令行：
```
playwright install
```
如上述无效，尝试更具体的命令：
```
python -m playwright install chromium
```

第二种方法在某些情况下更可靠。

同步版本安装

同步版本已弃用，将在未来版本移除。如需使用Selenium的同步版本：

pip install crawl4ai[sync]

开发安装

适用于计划修改源代码的贡献者：

git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .                    # Basic installation in editable mode

安装可选功能：

pip install -e ".[torch]"           # With PyTorch features
pip install -e ".[transformer]"     # With Transformer features
pip install -e ".[cosine]"          # With cosine similarity features
pip install -e ".[sync]"            # With synchronous crawling (Selenium)
pip install -e ".[all]"             # Install all optional features

🐳 Docker部署

🚀 现已推出！ 我们全新设计的Docker实现比以往更高效、更无缝。

新版Docker特性

包含以下改进：

浏览器池：页面预热实现更快响应
交互式演练场：测试并生成请求代码
MCP集成：直接连接Claude Code等AI工具
全面API端点：包括HTML提取、截图、PDF生成和JS执行
多架构支持：自动检测(AMD64/ARM64)
优化资源：改进内存管理

开始使用

# Pull and run the latest release candidate
docker pull unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number

# Visit the playground at http://localhost:11235/playground

完整文档参见Docker部署指南。

快速测试

运行快速测试(两种Docker选项均适用)：

import requests

# Submit a crawl job
response = requests.post(
    "http://localhost:11235/crawl",
    json={"urls": "https://example.com", "priority": 10}
)
task_id = response.json()["task_id"]

# Continue polling until the task is complete (status="completed")
result = requests.get(f"http://localhost:11235/task/{task_id}")

更多示例参见Docker示例。高级配置、环境变量和使用示例详见Docker部署指南。

🔬 高级使用示例 🔬

项目结构详见目录https://github.com/unclecode/crawl4ai/docs/examples。这里分享部分热门示例。

📝 启发式Markdown生成与精炼Markdown

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
    browser_config = BrowserConfig(
        headless=True,  
        verbose=True,
    )
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.ENABLED,
        markdown_generator=DefaultMarkdownGenerator(
            content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
        ),
        # markdown_generator=DefaultMarkdownGenerator(
        #     content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
        # ),
    )
    
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://docs.micronaut.io/4.7.6/guide/",
            config=run_config
        )
        print(len(result.markdown.raw_markdown))
        print(len(result.markdown.fit_markdown))

if __name__ == "__main__":
    asyncio.run(main())

🖥️ 执行JavaScript并提取结构化数据(无需LLM)

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json

async def main():
    schema = {
    "name": "KidoCode Courses",
    "baseSelector": "section.charge-methodology .w-tab-content > div",
    "fields": [
        {
            "name": "section_title",
            "selector": "h3.heading-50",
            "type": "text",
        },
        {
            "name": "section_description",
            "selector": ".charge-content",
            "type": "text",
        },
        {
            "name": "course_name",
            "selector": ".text-block-93",
            "type": "text",
        },
        {
            "name": "course_description",
            "selector": ".course-content-text",
            "type": "text",
        },
        {
            "name": "course_icon",
            "selector": ".image-92",
            "type": "attribute",
            "attribute": "src"
        }
    }
}

    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

    browser_config = BrowserConfig(
        headless=False,
        verbose=True
    )
    run_config = CrawlerRunConfig(
        extraction_strategy=extraction_strategy,
        js_code=["""(async () => {const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");for(let tab of tabs) {tab.scrollIntoView();tab.click();await new Promise(r => setTimeout(r, 500));}})();"""],
        cache_mode=CacheMode.BYPASS
    )
        
    async with AsyncWebCrawler(config=browser_config) as crawler:
        
        result = await crawler.arun(
            url="https://www.kidocode.com/degrees/technology",
            config=run_config
        )

        companies = json.loads(result.extracted_content)
        print(f"Successfully extracted {len(companies)} companies")
        print(json.dumps(companies[0], indent=2))


if __name__ == "__main__":
    asyncio.run(main())

📚 使用LLM提取结构化数据

import os
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")

async def main():
    browser_config = BrowserConfig(verbose=True)
    run_config = CrawlerRunConfig(
        word_count_threshold=1,
        extraction_strategy=LLMExtractionStrategy(
            # Here you can use any provider that Litellm library supports, for instance: ollama/qwen2
            # provider="ollama/qwen2", api_token="no-token", 
            llm_config = LLMConfig(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY')), 
            schema=OpenAIModelFee.schema(),
            extraction_type="schema",
            instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
            Do not miss any models in the entire content. One extracted model JSON format should look like this: 
            {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
        ),            
        cache_mode=CacheMode.BYPASS,
    )
    
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url='https://openai.com/api/pricing/',
            config=run_config
        )
        print(result.extracted_content)

if __name__ == "__main__":
    asyncio.run(main())

🤖 使用自定义用户配置的自有浏览器

import os, sys
from pathlib import Path
import asyncio, time
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def test_news_crawl():
    # Create a persistent user data directory
    user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
    os.makedirs(user_data_dir, exist_ok=True)

    browser_config = BrowserConfig(
        verbose=True,
        headless=True,
        user_data_dir=user_data_dir,
        use_persistent_context=True,
    )
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS
    )
    
    async with AsyncWebCrawler(config=browser_config) as crawler:
        url = "ADDRESS_OF_A_CHALLENGING_WEBSITE"
        
        result = await crawler.arun(
            url,
            config=run_config,
            magic=True,
        )
        
        print(f"Successfully crawled {url}")
        print(f"Content length: {len(result.markdown)}")

✨ 最新更新

版本0.6.0亮点

🌎 地理感知爬取：设置地理位置、语言和时区获取真实本地化内容：

  crun_cfg = CrawlerRunConfig(
      url="https://browserleaks.com/geo",          # 显示位置的测试页
      locale="en-US",                              # 接受语言和UI区域设置
      timezone_id="America/Los_Angeles",           # JS Date()/Intl时区
      geolocation=GeolocationConfig(                 # 覆盖GPS坐标
          latitude=34.0522,
          longitude=-118.2437,
          accuracy=10.0,
      )
  )

📊 表格转DataFrame：直接将HTML表格提取为CSV或pandas DataFrame：

  crawler = AsyncWebCrawler(config=browser_config)
  await crawler.start()

  try:
      # 设置爬取参数
      crawl_config = CrawlerRunConfig(
          table_score_threshold=8,  # 严格表格检测
      )

      # 执行市场数据提取
      results: List[CrawlResult] = await crawler.arun(
          url="https://coinmarketcap.com/?page=1", config=crawl_config
      )

      # 处理结果
      raw_df = pd.DataFrame()
      for result in results:
          if result.success and result.media["tables"]:
              raw_df = pd.DataFrame(
                  result.media["tables"][0]["rows"],
                  columns=result.media["tables"][0]["headers"],
              )
              break
      print(raw_df.head())

  finally:
      await crawler.stop()

🚀 浏览器池：页面通过预热的浏览器实例启动，降低延迟和内存使用

🕸️ 网络与控制台捕获：完整流量日志和MHTML快照用于调试：

crawler_config = CrawlerRunConfig(
    capture_network=True,
    capture_console=True,
    mhtml=True
)

🔌 MCP集成：通过模型上下文协议连接Claude Code等AI工具

# 将Crawl4AI添加至Claude Code
claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse

🖥️ 交互式演练场：通过内置Web界面测试配置并生成API请求，访问http://localhost:11235//playground
🐳 重构Docker部署：简化多架构Docker镜像，提升资源效率
📱 多阶段构建系统：优化Dockerfile实现平台特定性能增强

完整详情参见0.6.0发布说明或查看更新日志。

0.5.0 主要版本亮点

🚀 深度爬取系统：支持 BFS、DFS 和 BestFirst 策略探索初始 URL 之外的网站
⚡ 内存自适应调度器：根据系统内存动态调整并发量
🔄 多爬取策略：基于浏览器的爬虫和轻量级纯 HTTP 爬虫
💻 命令行界面：新增 crwl CLI 提供便捷终端访问
👤 浏览器配置工具：创建和管理持久化浏览器配置文件
🧠 Crawl4AI 编程助手：AI 驱动的代码辅助工具
🏎️ LXML 解析模式：使用 lxml 库实现高速 HTML 解析
🌐 代理轮换：内置代理切换支持
🤖 LLM 内容过滤器：利用大语言模型智能生成 Markdown
📄 PDF 处理：从 PDF 文件中提取文本、图像和元数据

完整更新详情请参阅 0.5.0 版本说明。

Crawl4AI 版本编号规范

Crawl4AI 遵循 Python 标准版本编号规范（PEP 440），帮助用户理解每个版本的稳定性和功能特性。

版本号说明

版本号格式为：主版本.次版本.修订号（例如 0.4.3）

预发布版本

我们使用不同后缀表示开发阶段：

dev（0.4.3dev1）：开发版本，不稳定
a（0.4.3a1）：Alpha 版本，实验性功能
b（0.4.3b1）：Beta 版本，功能完整但需测试
rc（0.4.3）：候选发布版本，可能成为最终版

安装方式

稳定版安装：
```
pip install -U crawl4ai
```
安装预发布版本：
```
pip install crawl4ai --pre
```
安装特定版本：
```
pip install crawl4ai==0.4.3b1
```

为何采用预发布机制？

预发布版本可帮助我们：

在真实场景测试新功能
收集最终发布前的用户反馈
确保生产环境的稳定性
让早期使用者体验新特性

生产环境建议使用稳定版。如需测试新功能，可通过 --pre 标志安装预发布版本。

📖 文档与路线图

🚨 文档更新预告：下周我们将进行全面文档重构，以反映最新更新和改进。敬请期待更全面、更及时的指南！

当前文档（含安装说明、高级功能和 API 参考）请访问文档网站。

开发计划和即将推出的功能请查看路线图。

📈 开发待办事项

🤝 贡献指南

我们欢迎开源社区的贡献。详见贡献指南。

以下是更新后的许可证章节：

📄 许可与署名

本项目采用 Apache 2.0 许可证并包含署名要求。详见 Apache 2.0 许可证文件。

署名要求

使用 Crawl4AI 时需包含以下任一种署名方式：

1. 徽章署名（推荐）

在 README、文档或网站中添加以下任一徽章：

主题	徽章
迪斯科主题（动态）
夜间主题（霓虹暗色）
暗色主题（经典）
亮色主题（经典）

添加徽章的 HTML 代码：

<!-- Disco Theme (Animated) -->
<a href="https://github.com/unclecode/crawl4ai">
  <img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-disco.svg" alt="Powered by Crawl4AI" width="200"/>
</a>

<!-- Night Theme (Dark with Neon) -->
<a href="https://github.com/unclecode/crawl4ai">
  <img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-night.svg" alt="Powered by Crawl4AI" width="200"/>
</a>

<!-- Dark Theme (Classic) -->
<a href="https://github.com/unclecode/crawl4ai">
  <img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-dark.svg" alt="Powered by Crawl4AI" width="200"/>
</a>

<!-- Light Theme (Classic) -->
<a href="https://github.com/unclecode/crawl4ai">
  <img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-light.svg" alt="Powered by Crawl4AI" width="200"/>
</a>

<!-- Simple Shield Badge -->
<a href="https://github.com/unclecode/crawl4ai">
  <img src="https://img.shields.io/badge/Powered%20by-Crawl4AI-blue?style=flat-square" alt="Powered by Crawl4AI"/>
</a>

2. 文本署名

在文档中添加以下文字：

This project uses Crawl4AI (https://github.com/unclecode/crawl4ai) for web data extraction.

📚 引用规范

如在研究或项目中使用 Crawl4AI，请引用：

@software{crawl4ai2024,
  author = {UncleCode},
  title = {Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub Repository},
  howpublished = {\url{https://github.com/unclecode/crawl4ai}},
  commit = {Please use the commit hash you're working with}
}

文本引用格式：

UncleCode. (2024). Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper [Computer software]. 
GitHub. https://github.com/unclecode/crawl4ai

📧 联系方式

如有疑问、建议或反馈，欢迎联系：

GitHub: unclecode
Twitter: @unclecode
官网: crawl4ai.com

祝您爬取愉快！🕸️🚀

🗾 使命宣言

我们的使命是通过将个人和企业数据转化为结构化、可交易的资产，释放数据价值。Crawl4AI 提供开源工具赋能个人和组织提取并结构化数据，共建共享数据经济。

我们期待由真实人类知识驱动的 AI 未来，确保数据创造者能直接受益。通过数据民主化和道德共享，为真正的 AI 进步奠定基础。

🔑 核心机遇

数据资产化：将数字足迹转化为可衡量的价值资产
真实 AI 数据：为 AI 系统提供真实人类洞察
共享经济：创建惠及数据创造者的公平数据市场

🚀 发展路径

开源工具：社区驱动的透明数据提取平台
数字资产结构化：组织和评估数字知识的工具
道德数据市场：安全公平的结构化数据交易平台

详情参见完整使命声明。

Translation Not Available Yet