From superpowers
Automatically builds image datasets from the web using textual metadata for query expansion and CNN-based filtering. Reduces bias and improves cross-dataset generalization.
How this skill is triggered — by the user, by Claude, or both
Slash command
/superpowers:textual-metadata-dataset-constructionThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill provides a framework for automatically collecting diverse, high-quality image datasets from the web using semantic query expansion and progressive CNN-based filtering. The methodology addresses key challenges:
This skill provides a framework for automatically collecting diverse, high-quality image datasets from the web using semantic query expansion and progressive CNN-based filtering. The methodology addresses key challenges:
Key Innovation: Uses Google Books Ngrams Corpora for query expansion to capture richer semantic descriptions, then progressively filters using CNNs.
Use this skill when:
Initial Query Definition:
initial_queries = ["dog", "car", "airplane"]
Semantic Expansion with N-gram Corpora:
def expand_query_with_ngrams(query, ngram_data):
"""
Expand query using Google Books Ngrams:
- Find co-occurring terms
- Add synonyms and related concepts
- Include descriptive modifiers
Example: "dog" → ["dog breed", "puppy", "canine",
"dog playing", "dog running", ...]
"""
expansions = []
# Get bigrams containing the query
bigrams = get_ngrams(query, n=2, ngram_data=ngram_data)
# Get trigrams for context
trigrams = get_ngrams(query, n=3, ngram_data=ngram_data)
# Combine and rank by frequency
expansions = rank_by_relevance(bigrams + trigrams)
return expansions
Visual Saliency Filtering:
def filter_expansions(expansions, visual_model):
"""
Remove expansions that are:
- Visually non-salient (abstract concepts)
- Less relevant to visual domain
- Too generic or too specific
Use pre-trained visual model to score saliency
"""
filtered = []
for exp in expansions:
# Check if expansion corresponds to visually identifiable concept
saliency_score = visual_model.predict_saliency(exp)
if saliency_score > threshold:
filtered.append(exp)
return filtered
Multi-Query Image Collection:
def collect_images(expanded_queries, images_per_query=500):
"""
Retrieve images using expanded queries:
- Use multiple search engines
- Collect metadata (source URL, query used)
- Diversify sources to reduce bias
"""
all_images = []
for query in expanded_queries:
images = search_engine.image_search(
query,
num_results=images_per_query
)
for img in images:
img['source_query'] = query
all_images.extend(images)
return all_images
Initial Preprocessing:
def preprocess_images(images):
"""
- Remove duplicates (perceptual hash)
- Validate image format
- Resize to standard dimensions
- Remove corrupted files
"""
pass
Feature Extraction:
def extract_features(images, cnn_model):
"""
Extract deep features using pre-trained CNN
(e.g., VGG, ResNet features from penultimate layer)
"""
features = []
for img in images:
feat = cnn_model.extract_features(img)
features.append(feat)
return np.array(features)
Cluster Analysis:
from sklearn.cluster import KMeans
def cluster_and_filter(features, images, n_clusters=10):
"""
Cluster images by visual similarity:
- Identify core clusters (likely relevant)
- Remove outlier clusters (likely noise)
- Keep images from dense, coherent clusters
"""
kmeans = KMeans(n_clusters=n_clusters)
clusters = kmeans.fit_predict(features)
# Analyze cluster statistics
cluster_stats = analyze_clusters(clusters, features)
# Remove outlier clusters (low density, high variance)
valid_clusters = [
c for c in cluster_stats
if c['density'] > threshold and c['coherence'] > min_coherence
]
filtered_images = [
img for img, c in zip(images, clusters)
if c in valid_clusters
]
return filtered_images
Initial CNN Training:
def train_initial_classifier(clustered_images, num_classes):
"""
Train initial CNN classifier on clustered data:
- Use cluster assignments as pseudo-labels
- Fine-tune pre-trained model
"""
model = load_pretrained_cnn()
model = fine_tune(model, clustered_images)
return model
Progressive Refinement:
def progressive_filtering(images, model, iterations=3):
"""
Iteratively refine dataset:
1. Classify all images with current model
2. Remove low-confidence predictions
3. Retrain model on refined set
4. Repeat
"""
for i in range(iterations):
# Predict on all images
predictions = model.predict(images)
# Filter by confidence
confident_samples = [
(img, pred) for img, pred in zip(images, predictions)
if pred['confidence'] > confidence_threshold(i)
]
# Retrain on refined set
model = train_classifier(confident_samples)
images = [s[0] for s in confident_samples]
return images, model
Quality Verification:
def verify_dataset_quality(dataset, test_set):
"""
Evaluate dataset quality:
- Cross-dataset generalization (test on STL-10, CIFAR-10)
- Class balance analysis
- Diversity metrics
"""
# Train classifier on generated dataset
model = train_classifier(dataset)
# Test on external datasets
stl10_accuracy = evaluate(model, stl10_test)
cifar10_accuracy = evaluate(model, cifar10_test)
return {
'cross_dataset_acc': (stl10_accuracy + cifar10_accuracy) / 2,
'class_balance': compute_balance(dataset),
'diversity': compute_diversity(dataset)
}
Export Dataset:
dataset/
├── train/
│ ├── class_1/
│ ├── class_2/
│ └── ...
├── val/
├── test/
├── metadata.json
└── dataset_stats.md
Query Expansion:
Clustering:
Progressive Filtering:
Validation:
Based on original research:
# Deep learning
pip install torch torchvision # or tensorflow
# Clustering
pip install scikit-learn
# Image processing
pip install pillow opencv-python
# N-gram data
# Download Google Books Ngrams: https://storage.googleapis.com/books/ngrams/books/datasetsv3.html
npx claudepluginhub lunartech-x/superpowers --plugin superpowersGenerates large-scale labeled image datasets using web scraping and Large Multimodal Models (Gemini Vision) with ~95% accuracy. For object detection and image classification projects.
Curates FiftyOne datasets: inspect schema, audit annotations, analyze class distributions, find duplicates, create subsets, and build train/val/test splits. Works with any CV dataset type.
Searches and retrieves computer vision datasets and models from Roboflow Universe. Use when finding public datasets by query, class, image count, or model type.