AI 爬虫核武器！Crawl4AI 横空出世，数据采集只需一行代码|知识库|crawl|config|markdown

AI 爬虫核武器！Crawl4AI 横空出世，数据采集只需一行代码

2025-04-19 20:32:04　来源: Ai学习的老章

北京举报

分享至

大家好，我是 Ai 学习的老章

推荐一个大模型周边项目

一、项目简介

Crawl4AI 是一款专为大语言模型（LLM）和 AI 应用设计的开源网页爬虫与数据抓取工具。它不仅能高效采集网页数据，还能直接输出结构化、干净的 Markdown 内容，非常适合用于 RAG（检索增强生成）、AI 微调、知识库建设等场景。

二、核心亮点

为 LLM 优化：输出智能、精炼的 Markdown，极大方便 AI 下游处理。
极速高效：实时爬取，速度提升 6 倍，性能与成本兼顾。
灵活浏览器控制：支持会话管理、代理、定制化 hook，轻松应对反爬与复杂页面。
启发式智能抽取：集成先进算法，减少对大模型的依赖，提升信息提取效率。
开源易部署：无需 API Key，支持 Docker 与云端部署。

三、安装与快速上手

安装

pip install crawl4ai crawl4ai-setup  # 一键配置浏览器环境

如遇浏览器相关问题，可手动安装 Playwright：

python -m playwright install --with-deps chromium

Python 快速示例

import asyncio from crawl4ai import * async def main():     async with AsyncWebCrawler() as crawler:         result = await crawler.arun(             url="[https://www.nbcnews.com/business",](https://www.nbcnews.com/business",)         )         print(result.markdown) if __name__ == "__main__":     asyncio.run(main())

命令行用法

# 基础爬取并输出 Markdown crwl [https://www.nbcnews.com/business](https://www.nbcnews.com/business) -o markdown # 深度爬取，BFS 策略，最多 10 页 crwl [https://docs.crawl4ai.com](https://docs.crawl4ai.com) --deep-crawl bfs --max-pages 10 # 调用 LLM 按问题抽取 crwl [https://www.example.com/products](https://www.example.com/products) -q "提取所有商品价格"

四、典型应用场景

构建 AI 知识库、FAQ、企业内网检索自动化采集新闻、论坛、商品信息支持自定义抽取策略，适配各类结构化/半结构化数据结合 LLM 做智能问答、信息抽取

五、进阶用法示例

自定义内容过滤与 Markdown 生成

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode from crawl4ai.content_filter_strategy import PruningContentFilter from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator asyncdef main():     browser_config = BrowserConfig(headless=True, verbose=True)     run_config = CrawlerRunConfig(         cache_mode=CacheMode.ENABLED,         markdown_generator=DefaultMarkdownGenerator(             content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)         )     )     asyncwith AsyncWebCrawler(config=browser_config) as crawler:         result = await crawler.arun(             url="[https://docs.micronaut.io/4.7.6/guide/",](https://docs.micronaut.io/4.7.6/guide/",)             config=run_config         )         print(result.markdown.raw_markdown)

自定义 Schema 结构化抽取

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode from crawl4ai.extraction_strategy import JsonCssExtractionStrategy import json asyncdef main():     schema = {         "name": "课程信息",         "baseSelector": "section.charge-methodology .w-tab-content > div",         "fields": [             {"name": "section_title", "selector": "h3.heading-50", "type": "text"},             {"name": "course_name", "selector": ".text-block-93", "type": "text"},             {"name": "course_icon", "selector": ".image-92", "type": "attribute", "attribute": "src"}         ]     }     extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)     browser_config = BrowserConfig(headless=False, verbose=True)     run_config = CrawlerRunConfig(extraction_strategy=extraction_strategy, cache_mode=CacheMode.BYPASS)     asyncwith AsyncWebCrawler(config=browser_config) as crawler:         result = await crawler.arun(             url="[https://www.kidocode.com/degrees/technology",](https://www.kidocode.com/degrees/technology",)             config=run_config         )         companies = json.loads(result.extracted_content)         print(json.dumps(companies, indent=2))

制作不易，如果这篇文章觉得对你有用，可否点个关注。给我个三连击：点赞、转发和在看。若可以再给我加个，谢谢你看我的文章，我们下篇再见！

特别声明：以上内容(如有图片或视频亦包括在内)为自媒体平台“网易号”用户上传并发布，本平台仅提供信息存储服务。

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.