automated-image-dataset-generation | superpowers

Stats

Actions

Tags

automated-image-dataset-generation | superpowers

Automated Image Dataset Generation with LMMs

Overview

This skill provides a scalable, reusable framework for automatically generating labeled image datasets using web scraping combined with Large Multimodal Models (LMMs) for metadata generation. The methodology addresses the challenge of manual data collection being resource-intensive, error-prone, and time-consuming.

Key Capabilities:

Automated web image collection at scale (50,000+ images)
LMM-powered metadata generation with ~95% accuracy
Rule-based filtering for domain-specific categorization
Structured output for object detection and classification tasks

When to Use This Skill

Use this skill when:

Building custom image datasets for machine learning applications
Collecting domain-specific images that aren't available in existing datasets
Needing automated image labeling/metadata generation
Working on object detection or image classification projects
Manual annotation is too expensive or time-consuming
Requiring large-scale training data for computer vision models

Core Workflow

Phase 1: Query Design and Planning

Define Target Categories:
- Identify specific objects/classes to collect
- Create hierarchical category structure if needed
- Example categories: beams, columns, trusses, steel frames

Design Search Queries:

# Generate diverse search queries
categories = ["structural steel beam", "steel column construction", "roof truss"]
query_variations = [
    f"{cat} {mod}" 
    for cat in categories 
    for mod in ["photo", "site", "construction", "building"]
]

Set Collection Parameters:
- Target image count per category
- Image quality thresholds (resolution, format)
- Source diversity requirements

Phase 2: Web Scraping

Implement Multi-Source Scraping:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver

def scrape_images(query, num_images=1000):
    """
    Scrape images from multiple sources:
    - Google Images
    - Bing Images
    - Domain-specific sites
    """
    images = []
    
    # Use appropriate rate limiting
    # Respect robots.txt
    # Store source URLs for attribution
    
    return images

Image Download and Storage:

def download_images(image_urls, output_dir):
    """
    Download images with:
    - Duplicate detection (hash-based)
    - Format validation
    - Resolution filtering
    - Metadata preservation
    """
    pass

Initial Filtering:
- Remove corrupted/invalid images
- Filter by minimum resolution (e.g., 224x224)
- Deduplicate using perceptual hashing

Phase 3: LMM-Based Metadata Generation

Configure LMM (Gemini Vision or equivalent):

import google.generativeai as genai

genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
model = genai.GenerativeModel('gemini-1.5-flash')

def generate_metadata(image_path, categories):
    """
    Use LMM to analyze image and generate metadata
    """
    image = PIL.Image.open(image_path)
    
    prompt = f"""
    Analyze this image and determine:
    1. Does it contain any of these objects: {categories}?
    2. If yes, which specific category?
    3. Confidence level (high/medium/low)
    4. Object location description (for detection tasks)
    5. Image quality assessment
    
    Return structured JSON response.
    """
    
    response = model.generate_content([prompt, image])
    return parse_response(response.text)

Batch Processing:

def process_dataset(image_dir, categories, batch_size=100):
    """
    Process images in batches with:
    - Rate limiting
    - Error handling
    - Progress tracking
    - Checkpoint saving
    """
    results = []
    for batch in get_batches(image_dir, batch_size):
        batch_results = [
            generate_metadata(img, categories) 
            for img in batch
        ]
        results.extend(batch_results)
        save_checkpoint(results)
    return results

Quality Metrics:
- Track LMM confidence scores
- Flag low-confidence predictions for review
- Calculate category distribution

Phase 4: Rule-Based Filtering

Apply Category Rules:

def filter_by_rules(metadata, rules):
    """
    Apply domain-specific rules:
    - Minimum confidence threshold (e.g., 0.8)
    - Category-specific validation
    - Cross-reference with search query
    """
    filtered = []
    for item in metadata:
        if item['confidence'] >= rules['min_confidence']:
            if validate_category(item, rules):
                filtered.append(item)
    return filtered

Handle Edge Cases:
- Multi-label images (multiple categories)
- Ambiguous classifications
- Partial object visibility

Phase 5: Dataset Finalization

Generate Dataset Structure:

dataset/
├── images/
│   ├── category_1/
│   ├── category_2/
│   └── ...
├── annotations/
│   ├── metadata.json
│   └── labels.csv
├── splits/
│   ├── train.txt
│   ├── val.txt
│   └── test.txt
└── README.md

Create Annotation Files:

def create_annotations(filtered_data, output_dir):
    """
    Generate standard annotation formats:
    - COCO format (for object detection)
    - CSV with labels (for classification)
    - YOLO format (if needed)
    """
    pass

Split Dataset:
- Train/Val/Test split (typically 70/15/15)
- Stratified splitting by category
- Ensure no data leakage

Best Practices

Web Scraping

Respect rate limits: 1-2 requests per second
Rotate user agents: Avoid detection
Use proxies: For large-scale collection
Cache responses: Avoid redundant downloads
Store source URLs: For attribution and verification

LMM Usage

Use appropriate prompts: Be specific about expected output format
Batch processing: Optimize API costs
Handle API errors: Implement retry logic with exponential backoff
Validate responses: Parse and validate JSON responses

Data Quality

Verify sample manually: Check 100-200 random samples
Calculate inter-annotator agreement: If using multiple LMMs
Document accuracy metrics: Report precision/recall per category
Version your dataset: Track changes over time

Legal & Ethical

Check image licenses: Prefer CC-licensed content
Respect robots.txt: Don't scrape disallowed pages
Attribute sources: Maintain source URLs
Consider privacy: Filter personal/sensitive content

Expected Results

Based on the original research:

Collection scale: 50,000+ raw images
After filtering: ~5% relevant images (domain-specific)
Metadata accuracy: 94.8%
Categories: Successfully identifies 5+ distinct categories

Integration with Other Skills

scientific-schematics: Generate dataset visualization diagrams
exploratory-data-analysis: Analyze dataset statistics
pytorch: Train models on generated dataset
matplotlib/seaborn: Visualize class distributions

Dependencies

# Core
pip install requests beautifulsoup4 selenium pillow

# LMM
pip install google-generativeai  # or openai for GPT-4V

# Image processing
pip install imagehash opencv-python

# Dataset tools
pip install pandas scikit-learn

References

Gharib, S., & Moselhi, O. (2025). Automated Image Dataset Generation Using Web Scraping and Large Multimodal Models for Construction Applications. ISARC 2025.