Crawl4AI是GitHub趋势榜排名第一的项目,由活跃社区持续维护。它为LLM、AI智能体和数据管道提供极速、AI就绪的网络爬取能力。作为开源工具,Crawl4AI兼具灵活性和实时性能,为开发者提供无与伦比的速度、精度和部署便捷性。
🎉 版本0.6.0现已发布! 此候选版本引入了地理感知爬取、表格转DataFrame提取、浏览器预热池、网络控制台流量捕获、MCP AI工具集成,以及全面重构的Docker部署!阅读发布说明 →
我的计算机之旅始于童年,当时作为计算机科学家的父亲向我介绍了Amstrad电脑。这段早期经历激发了我对技术的热爱,促使我在研究生阶段专攻自然语言处理。正是在那时,我第一次涉足网络爬虫领域,构建工具帮助研究人员整理论文和从出版物中提取信息——这段充满挑战又收获颇丰的经历磨练了我的数据提取技能。
时间来到2023年,我在为一个项目开发工具时需要将网页转换为Markdown的爬虫。在探索解决方案时,我发现一个声称开源的工具却要求创建账户并生成API令牌。更糟的是,它实际上是收费16美元的SaaS服务,且质量不尽人意。出于沮丧,我意识到这是个更深层的问题。这种挫败感转化为动力,我决定自己构建解决方案。短短几天内,我创造了Crawl4AI。出乎意料的是,它迅速走红,获得了数千GitHub星标,引起了全球社区的共鸣。
我将Crawl4AI开源有两个原因。首先,这是我对职业生涯中一直支持我的开源社区的回报。其次,我相信数据应该对所有人开放,而不是被付费墙封锁或被少数人垄断。数据的开放获取为AI民主化奠定了基础,让每个人都能训练自己的模型并掌握自己的信息。这个库是创建有史以来最佳开源数据提取和生成工具的第一步,将由充满热情的社区共同构建。
感谢所有支持这个项目、使用它并提供反馈的人。你们的鼓励激励我追求更大的梦想。加入我们,提交问题,发起PR,或传播这个项目。我们可以共同打造真正赋能人们获取自身数据、重塑AI未来的工具。
# Install the package
pip install -U crawl4ai
# For pre release versions
pip install crawl4ai --pre
# Run post-installation setup
crawl4ai-setup
# Verify your installation
crawl4ai-doctor
如果遇到浏览器相关问题,可以手动安装:
python -m playwright install --with-deps chromium
import asyncio
from crawl4ai import *
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
# Basic crawl with markdown output
crwl https://www.nbcnews.com/business -o markdown
# Deep crawl with BFS strategy, max 10 pages
crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10
# Use LLM extraction with a specific question
crwl https://www.example.com/products -q "Extract all product prices"
srcset
和picture
raw:
)或本地文件(file://
)✨ 访问文档网站
Crawl4AI提供灵活的安装选项适应不同使用场景。您可以选择Python包安装或Docker部署。
根据需求选择安装方式:
适用于基础网页爬取任务:
pip install crawl4ai
crawl4ai-setup # Setup the browser
默认安装异步版Crawl4AI,使用Playwright进行网页爬取。
👉 注意:安装Crawl4AI时,crawl4ai-setup
应自动安装并配置Playwright。如遇Playwright相关问题,可手动安装:
通过命令行:
playwright install
如上述无效,尝试更具体的命令:
python -m playwright install chromium
第二种方法在某些情况下更可靠。
同步版本已弃用,将在未来版本移除。如需使用Selenium的同步版本:
pip install crawl4ai[sync]
适用于计划修改源代码的贡献者:
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e . # Basic installation in editable mode
安装可选功能:
pip install -e ".[torch]" # With PyTorch features
pip install -e ".[transformer]" # With Transformer features
pip install -e ".[cosine]" # With cosine similarity features
pip install -e ".[sync]" # With synchronous crawling (Selenium)
pip install -e ".[all]" # Install all optional features
🚀 现已推出! 我们全新设计的Docker实现比以往更高效、更无缝。
包含以下改进:
# Pull and run the latest release candidate
docker pull unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
# Visit the playground at http://localhost:11235/playground
完整文档参见Docker部署指南。
运行快速测试(两种Docker选项均适用):
import requests
# Submit a crawl job
response = requests.post(
"http://localhost:11235/crawl",
json={"urls": "https://example.com", "priority": 10}
)
task_id = response.json()["task_id"]
# Continue polling until the task is complete (status="completed")
result = requests.get(f"http://localhost:11235/task/{task_id}")
更多示例参见Docker示例。高级配置、环境变量和使用示例详见Docker部署指南。
项目结构详见目录https://github.com/unclecode/crawl4ai/docs/examples。这里分享部分热门示例。
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
browser_config = BrowserConfig(
headless=True,
verbose=True,
)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
),
# markdown_generator=DefaultMarkdownGenerator(
# content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
# ),
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://docs.micronaut.io/4.7.6/guide/",
config=run_config
)
print(len(result.markdown.raw_markdown))
print(len(result.markdown.fit_markdown))
if __name__ == "__main__":
asyncio.run(main())
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json
async def main():
schema = {
"name": "KidoCode Courses",
"baseSelector": "section.charge-methodology .w-tab-content > div",
"fields": [
{
"name": "section_title",
"selector": "h3.heading-50",
"type": "text",
},
{
"name": "section_description",
"selector": ".charge-content",
"type": "text",
},
{
"name": "course_name",
"selector": ".text-block-93",
"type": "text",
},
{
"name": "course_description",
"selector": ".course-content-text",
"type": "text",
},
{
"name": "course_icon",
"selector": ".image-92",
"type": "attribute",
"attribute": "src"
}
}
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
browser_config = BrowserConfig(
headless=False,
verbose=True
)
run_config = CrawlerRunConfig(
extraction_strategy=extraction_strategy,
js_code=["""(async () => {const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");for(let tab of tabs) {tab.scrollIntoView();tab.click();await new Promise(r => setTimeout(r, 500));}})();"""],
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.kidocode.com/degrees/technology",
config=run_config
)
companies = json.loads(result.extracted_content)
print(f"Successfully extracted {len(companies)} companies")
print(json.dumps(companies[0], indent=2))
if __name__ == "__main__":
asyncio.run(main())
import os
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
async def main():
browser_config = BrowserConfig(verbose=True)
run_config = CrawlerRunConfig(
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
# Here you can use any provider that Litellm library supports, for instance: ollama/qwen2
# provider="ollama/qwen2", api_token="no-token",
llm_config = LLMConfig(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY')),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
cache_mode=CacheMode.BYPASS,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url='https://openai.com/api/pricing/',
config=run_config
)
print(result.extracted_content)
if __name__ == "__main__":
asyncio.run(main())
import os, sys
from pathlib import Path
import asyncio, time
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def test_news_crawl():
# Create a persistent user data directory
user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
os.makedirs(user_data_dir, exist_ok=True)
browser_config = BrowserConfig(
verbose=True,
headless=True,
user_data_dir=user_data_dir,
use_persistent_context=True,
)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(config=browser_config) as crawler:
url = "ADDRESS_OF_A_CHALLENGING_WEBSITE"
result = await crawler.arun(
url,
config=run_config,
magic=True,
)
print(f"Successfully crawled {url}")
print(f"Content length: {len(result.markdown)}")
🌎 地理感知爬取:设置地理位置、语言和时区获取真实本地化内容:
crun_cfg = CrawlerRunConfig(
url="https://browserleaks.com/geo", # 显示位置的测试页
locale="en-US", # 接受语言和UI区域设置
timezone_id="America/Los_Angeles", # JS Date()/Intl时区
geolocation=GeolocationConfig( # 覆盖GPS坐标
latitude=34.0522,
longitude=-118.2437,
accuracy=10.0,
)
)
📊 表格转DataFrame:直接将HTML表格提取为CSV或pandas DataFrame:
crawler = AsyncWebCrawler(config=browser_config)
await crawler.start()
try:
# 设置爬取参数
crawl_config = CrawlerRunConfig(
table_score_threshold=8, # 严格表格检测
)
# 执行市场数据提取
results: List[CrawlResult] = await crawler.arun(
url="https://coinmarketcap.com/?page=1", config=crawl_config
)
# 处理结果
raw_df = pd.DataFrame()
for result in results:
if result.success and result.media["tables"]:
raw_df = pd.DataFrame(
result.media["tables"][0]["rows"],
columns=result.media["tables"][0]["headers"],
)
break
print(raw_df.head())
finally:
await crawler.stop()
🚀 浏览器池:页面通过预热的浏览器实例启动,降低延迟和内存使用
🕸️ 网络与控制台捕获:完整流量日志和MHTML快照用于调试:
crawler_config = CrawlerRunConfig(
capture_network=True,
capture_console=True,
mhtml=True
)
🔌 MCP集成:通过模型上下文协议连接Claude Code等AI工具
# 将Crawl4AI添加至Claude Code
claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse
🖥️ 交互式演练场:通过内置Web界面测试配置并生成API请求,访问http://localhost:11235//playground
🐳 重构Docker部署:简化多架构Docker镜像,提升资源效率
📱 多阶段构建系统:优化Dockerfile实现平台特定性能增强
crwl
CLI 提供便捷终端访问lxml
库实现高速 HTML 解析完整更新详情请参阅 0.5.0 版本说明。
Crawl4AI 遵循 Python 标准版本编号规范(PEP 440),帮助用户理解每个版本的稳定性和功能特性。
版本号格式为:主版本.次版本.修订号
(例如 0.4.3)
我们使用不同后缀表示开发阶段:
dev
(0.4.3dev1):开发版本,不稳定a
(0.4.3a1):Alpha 版本,实验性功能b
(0.4.3b1):Beta 版本,功能完整但需测试rc
(0.4.3):候选发布版本,可能成为最终版稳定版安装:
pip install -U crawl4ai
安装预发布版本:
pip install crawl4ai --pre
安装特定版本:
pip install crawl4ai==0.4.3b1
预发布版本可帮助我们:
生产环境建议使用稳定版。如需测试新功能,可通过 --pre
标志安装预发布版本。
🚨 文档更新预告:下周我们将进行全面文档重构,以反映最新更新和改进。敬请期待更全面、更及时的指南!
当前文档(含安装说明、高级功能和 API 参考)请访问文档网站。
开发计划和即将推出的功能请查看路线图。
我们欢迎开源社区的贡献。详见贡献指南。
以下是更新后的许可证章节:
本项目采用 Apache 2.0 许可证并包含署名要求。详见 Apache 2.0 许可证文件。
使用 Crawl4AI 时需包含以下任一种署名方式:
在 README、文档或网站中添加以下任一徽章:
主题 | 徽章 |
---|---|
迪斯科主题(动态) | |
夜间主题(霓虹暗色) | |
暗色主题(经典) | |
亮色主题(经典) |
添加徽章的 HTML 代码:
<!-- Disco Theme (Animated) -->
<a href="https://github.com/unclecode/crawl4ai">
<img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-disco.svg" alt="Powered by Crawl4AI" width="200"/>
</a>
<!-- Night Theme (Dark with Neon) -->
<a href="https://github.com/unclecode/crawl4ai">
<img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-night.svg" alt="Powered by Crawl4AI" width="200"/>
</a>
<!-- Dark Theme (Classic) -->
<a href="https://github.com/unclecode/crawl4ai">
<img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-dark.svg" alt="Powered by Crawl4AI" width="200"/>
</a>
<!-- Light Theme (Classic) -->
<a href="https://github.com/unclecode/crawl4ai">
<img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-light.svg" alt="Powered by Crawl4AI" width="200"/>
</a>
<!-- Simple Shield Badge -->
<a href="https://github.com/unclecode/crawl4ai">
<img src="https://img.shields.io/badge/Powered%20by-Crawl4AI-blue?style=flat-square" alt="Powered by Crawl4AI"/>
</a>
在文档中添加以下文字:
This project uses Crawl4AI (https://github.com/unclecode/crawl4ai) for web data extraction.
如在研究或项目中使用 Crawl4AI,请引用:
@software{crawl4ai2024,
author = {UncleCode},
title = {Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper},
year = {2024},
publisher = {GitHub},
journal = {GitHub Repository},
howpublished = {\url{https://github.com/unclecode/crawl4ai}},
commit = {Please use the commit hash you're working with}
}
文本引用格式:
UncleCode. (2024). Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper [Computer software].
GitHub. https://github.com/unclecode/crawl4ai
如有疑问、建议或反馈,欢迎联系:
祝您爬取愉快!🕸️🚀
我们的使命是通过将个人和企业数据转化为结构化、可交易的资产,释放数据价值。Crawl4AI 提供开源工具赋能个人和组织提取并结构化数据,共建共享数据经济。
我们期待由真实人类知识驱动的 AI 未来,确保数据创造者能直接受益。通过数据民主化和道德共享,为真正的 AI 进步奠定基础。