A robust web scraping system for collecting book data from books.toscrape.com using AI-powered extraction with crawl4ai and Gemini LLM. This project demonstrates advanced web crawling techniques with structured data extraction for RAG (Retrieval-Augmented Generation) applications.
The scraper extracts the following book information:
books-to-scrape-dataset/
├── main.py # Main crawling orchestration
├── config.py # Configuration constants
├── models/
│ └── Book.py # Pydantic model for book data
├── utils/
│ ├── data_utils.py # Data processing and file I/O
│ └── scraper_utils.py # Web scraping utilities
├── data/
│ ├── books_data.json # Structured JSON output
│ └── complete_books.csv # CSV format output
└── pyproject.toml # Project dependencies
git clone <repository-url>
cd uv-rag-crawl-v3
# Using UV (recommended)
uv sync
# Or using pip
pip install -r requirements.txt
# Create .env file
echo "GEMINI_API_KEY=your_gemini_api_key_here" > .env
GEMINI_API_KEY
: Your Gemini API key for AI-powered extractionconfig.py
)BASE_URL = "https://books.toscrape.com/catalogue/"
CSS_SELECTOR = "[class^='product_pod']"
REQUIRED_KEYS = [
"detail_url",
"image_url",
"rating",
"title",
"price",
"in_stock_availability",
]
python main.py
data/books_data.json
: Structured JSON with metadatadata/complete_books.csv
: CSV format for spreadsheet analysis{
"metadata": {
"total_books": 993,
"scraped_at": "2025-08-09 02:03:08.337856",
"source": "books.toscrape.com",
"description": "Book data scraped for RAG implementation"
},
"books": [
{
"detail_url": "https://books.toscrape.com/catalogue/book-title/index.html",
"image_url": "https://books.toscrape.com/media/cache/image.jpg",
"rating": "Four",
"title": "Book Title",
"price": "51.77",
"in_stock_availability": "True"
}
]
}
| detail_url | image_url | rating | title | price | in_stock_availability | |————|———–|——–|——-|——-|———————-| | https://… | https://… | Four | Book Title | 51.77 | True |
The project uses crawl4ai with Gemini 2.0 Flash for intelligent data extraction:
git checkout -b feature/amazing-feature
)git commit -m 'Add amazing feature'
)git push origin feature/amazing-feature
)This project is licensed under the MIT License - see the LICENSE file for details.
This project is for educational and research purposes. The scraped data comes from a demo website designed for web scraping practice. Always respect website terms of service and robots.txt when scraping real websites.
GEMINI_API_KEY
is set in .env
main.py
if neededEnable verbose logging by modifying get_browser_config()
in utils/scraper_utils.py
:
def get_browser_config() -> BrowserConfig:
return BrowserConfig(browser_type="chromium", headless=False, verbose=True)
Note: This project demonstrates advanced web scraping techniques with AI-powered data extraction. Use responsibly and in accordance with website terms of service.