books-to-scrape-dataset

📦 Books Scraper — Dataset for RAG & AI Projects

Last Commit GitHub Stars GitHub Forks Dataset Size

A robust web scraping system for collecting book data from books.toscrape.com using AI-powered extraction with crawl4ai and Gemini LLM. This project demonstrates advanced web crawling techniques with structured data extraction for RAG (Retrieval-Augmented Generation) applications.

🚀 Features

AI-Powered Extraction: Uses Gemini 2.0 Flash for intelligent data extraction
Structured Data Output: Generates both JSON and CSV formats with metadata
Duplicate Detection: Prevents duplicate book entries during crawling
Rate Limiting: Implements polite crawling with configurable delays
Error Handling: Robust error handling and logging
Type Safety: Full type hints and Pydantic models for data validation

📊 Data Collected

The scraper extracts the following book information:

Title: Book title
Detail URL: Full URL to book detail page
Image URL: Book cover image URL
Rating: Star rating (One to Five)
Price: Book price in GBP
Stock Availability: In stock status (True/False)

📁 Project Structure

books-to-scrape-dataset/
├── main.py                 # Main crawling orchestration
├── config.py              # Configuration constants
├── models/
│   └── Book.py           # Pydantic model for book data
├── utils/
│   ├── data_utils.py     # Data processing and file I/O
│   └── scraper_utils.py  # Web scraping utilities
├── data/
│   ├── books_data.json   # Structured JSON output
│   └── complete_books.csv # CSV format output
└── pyproject.toml        # Project dependencies

🛠️ Installation

Prerequisites

Python 3.12 or higher
UV package manager (recommended) or pip

Setup

Clone the repository

git clone <repository-url>
cd uv-rag-crawl-v3

Install dependencies

# Using UV (recommended)
uv sync
   
# Or using pip
pip install -r requirements.txt

Environment Setup

# Create .env file
echo "GEMINI_API_KEY=your_gemini_api_key_here" > .env

🔧 Configuration

Environment Variables

GEMINI_API_KEY: Your Gemini API key for AI-powered extraction

Configuration File (`config.py`)

BASE_URL = "https://books.toscrape.com/catalogue/"
CSS_SELECTOR = "[class^='product_pod']"
REQUIRED_KEYS = [
    "detail_url",
    "image_url", 
    "rating",
    "title",
    "price",
    "in_stock_availability",
]

🚀 Usage

Basic Usage

python main.py

What Happens

Browser Initialization: Starts a Chromium browser instance
Page Crawling: Iteratively crawls through book catalog pages
AI Extraction: Uses Gemini LLM to extract structured book data
Data Processing: Validates and deduplicates book entries
File Output: Saves data to both JSON and CSV formats

Output Files

data/books_data.json: Structured JSON with metadata
data/complete_books.csv: CSV format for spreadsheet analysis

📊 Data Format

JSON Structure

{
  "metadata": {
    "total_books": 993,
    "scraped_at": "2025-08-09 02:03:08.337856",
    "source": "books.toscrape.com",
    "description": "Book data scraped for RAG implementation"
  },
  "books": [
    {
      "detail_url": "https://books.toscrape.com/catalogue/book-title/index.html",
      "image_url": "https://books.toscrape.com/media/cache/image.jpg",
      "rating": "Four",
      "title": "Book Title",
      "price": "51.77",
      "in_stock_availability": "True"
    }
  ]
}

CSV Structure

🔍 Technical Details

AI Extraction Strategy

The project uses crawl4ai with Gemini 2.0 Flash for intelligent data extraction:

Schema-based Extraction: Uses Pydantic models for type-safe data
HTML Structure Analysis: Leverages LLM understanding of HTML patterns
Error Recovery: Handles malformed data gracefully

Rate Limiting

2-second delays between page requests
Polite crawling practices
Session management for efficient requests

Data Validation

Pydantic Models: Ensures data type safety
Duplicate Detection: Prevents redundant entries
Required Field Validation: Ensures complete data extraction

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Data Source: books.toscrape.com - Demo website for web scraping practice
crawl4ai: Advanced web crawling library
Gemini API: AI-powered data extraction

⚠️ Disclaimer

This project is for educational and research purposes. The scraped data comes from a demo website designed for web scraping practice. Always respect website terms of service and robots.txt when scraping real websites.

📈 Performance

Current Dataset: 993 books
File Sizes:
- JSON: ~357KB
- CSV: ~209KB
Crawling Speed: ~2 seconds per page (with delays)

🔧 Troubleshooting

Common Issues

API Key Error: Ensure GEMINI_API_KEY is set in .env
Browser Issues: Check if Chromium is installed
Rate Limiting: Increase delays in main.py if needed

Debug Mode

Enable verbose logging by modifying get_browser_config() in utils/scraper_utils.py:

def get_browser_config() -> BrowserConfig:
    return BrowserConfig(browser_type="chromium", headless=False, verbose=True)

Note: This project demonstrates advanced web scraping techniques with AI-powered data extraction. Use responsibly and in accordance with website terms of service.

This site is open source. Improve this page.