books-to-scrape-dataset

📦 Books Scraper — Dataset for RAG & AI Projects

License: MIT Python Scraping Status Last Commit GitHub Stars GitHub Forks Dataset Size


A robust web scraping system for collecting book data from books.toscrape.com using AI-powered extraction with crawl4ai and Gemini LLM. This project demonstrates advanced web crawling techniques with structured data extraction for RAG (Retrieval-Augmented Generation) applications.

🚀 Features

📊 Data Collected

The scraper extracts the following book information:

📁 Project Structure

books-to-scrape-dataset/
├── main.py                 # Main crawling orchestration
├── config.py              # Configuration constants
├── models/
│   └── Book.py           # Pydantic model for book data
├── utils/
│   ├── data_utils.py     # Data processing and file I/O
│   └── scraper_utils.py  # Web scraping utilities
├── data/
│   ├── books_data.json   # Structured JSON output
│   └── complete_books.csv # CSV format output
└── pyproject.toml        # Project dependencies

🛠️ Installation

Prerequisites

Setup

  1. Clone the repository
    git clone <repository-url>
    cd uv-rag-crawl-v3
    
  2. Install dependencies
    # Using UV (recommended)
    uv sync
       
    # Or using pip
    pip install -r requirements.txt
    
  3. Environment Setup
    # Create .env file
    echo "GEMINI_API_KEY=your_gemini_api_key_here" > .env
    

🔧 Configuration

Environment Variables

Configuration File (config.py)

BASE_URL = "https://books.toscrape.com/catalogue/"
CSS_SELECTOR = "[class^='product_pod']"
REQUIRED_KEYS = [
    "detail_url",
    "image_url", 
    "rating",
    "title",
    "price",
    "in_stock_availability",
]

🚀 Usage

Basic Usage

python main.py

What Happens

  1. Browser Initialization: Starts a Chromium browser instance
  2. Page Crawling: Iteratively crawls through book catalog pages
  3. AI Extraction: Uses Gemini LLM to extract structured book data
  4. Data Processing: Validates and deduplicates book entries
  5. File Output: Saves data to both JSON and CSV formats

Output Files

📊 Data Format

JSON Structure

{
  "metadata": {
    "total_books": 993,
    "scraped_at": "2025-08-09 02:03:08.337856",
    "source": "books.toscrape.com",
    "description": "Book data scraped for RAG implementation"
  },
  "books": [
    {
      "detail_url": "https://books.toscrape.com/catalogue/book-title/index.html",
      "image_url": "https://books.toscrape.com/media/cache/image.jpg",
      "rating": "Four",
      "title": "Book Title",
      "price": "51.77",
      "in_stock_availability": "True"
    }
  ]
}

CSV Structure

| detail_url | image_url | rating | title | price | in_stock_availability | |————|———–|——–|——-|——-|———————-| | https://… | https://… | Four | Book Title | 51.77 | True |

🔍 Technical Details

AI Extraction Strategy

The project uses crawl4ai with Gemini 2.0 Flash for intelligent data extraction:

Rate Limiting

Data Validation

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

⚠️ Disclaimer

This project is for educational and research purposes. The scraped data comes from a demo website designed for web scraping practice. Always respect website terms of service and robots.txt when scraping real websites.

📈 Performance

🔧 Troubleshooting

Common Issues

  1. API Key Error: Ensure GEMINI_API_KEY is set in .env
  2. Browser Issues: Check if Chromium is installed
  3. Rate Limiting: Increase delays in main.py if needed

Debug Mode

Enable verbose logging by modifying get_browser_config() in utils/scraper_utils.py:

def get_browser_config() -> BrowserConfig:
    return BrowserConfig(browser_type="chromium", headless=False, verbose=True)

Note: This project demonstrates advanced web scraping techniques with AI-powered data extraction. Use responsibly and in accordance with website terms of service.