From guixu
Discovers, evaluates, and acquires datasets for AI model training/fine-tuning from Kaggle, HuggingFace, IPFS, arXiv, DBLP. Assesses quality, licensing, provenance; downloads free/paid data.
How this skill is triggered — by the user, by Claude, or both
Slash command
/guixu:guixuThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
<!--
Guixu provides dataset discovery, valuation, and acquisition for AI agents. It searches across Kaggle, HuggingFace, IPFS, arXiv, DBLP, and other sources.
✅ USE this skill when:
❌ DON'T use this skill when:
Always start with intent_parse to structure the request:
intent_parse(query="find me a cat image dataset for classification", task_type="classification")
This returns a structured QueryProfile with:
task_type: classification, detection, segmentation, etc.keywords: dataset content termstarget_entity: main subjectdata_standard: sample_unit, budget, schema expectationsUse dataset_search with keywords from intent_parse:
dataset_search(query="cat image", task_type="classification", limit=10)
Supported sources (leave empty to search all):
kaggle, huggingface, ipfs, bittorrentarxiv, dblp, semanticscholardefillama, rwa_xyz, guixu-hubpansearch (cloud drives)Filter options:
filters.max_price: maximum price in USDfilters.free_only: only free datasetsfilters.license: specific license (e.g., "CC-BY-4.0")For each promising candidate, call dataset_evaluate:
dataset_evaluate(cid="kaggle:owner/dataset-name", task_description="cat image classification", task_type="classification", required_columns=["image_path", "label"])
This returns:
tcv_score: -100 (harmful) to +100 (highly valuable)schema_fit: compatibility with required columnscommunity_signal: reviews and ratingsOnce a dataset is selected:
dataset_download(cid="kaggle:owner/dataset-name")
# or
dataset_download(cid="hf:owner/dataset-name")
Free sources: uci:, openml:, zenodo:, figshare:, hf: (public), ipfs:, guixu-hub:
Requires login: kaggle:
1. intent_parse(query="find me a dog vs cat classification dataset", task_type="classification")
2. dataset_search(query="cat dog classification", task_type="classification", limit=10)
3. dataset_evaluate(cid="kaggle:username/dataset", task_description="dog cat binary classification", task_type="classification", required_columns=["image_path", "label"])
4. dataset_download(cid="kaggle:username/dataset")
1. intent_parse(query="find helmet detection dataset with bounding boxes", task_type="detection")
2. dataset_search(query="helmet detection bounding box", task_type="detection", limit=10)
3. For each candidate: dataset_evaluate(cid, task_description="helmet detection", task_type="detection", required_columns=["image_path", "bbox"])
4. Download best candidate
1. intent_parse(query="find stock price dataset for time series forecasting", task_type="forecasting")
2. dataset_search(query="stock price time series", task_type="forecasting", filters={source: "defillama"})
3. dataset_evaluate(cid, task_description="stock price prediction", task_type="forecasting", required_columns=["timestamp", "price"])
4. dataset_download(cid)
intent_parse FIRST — it extracts task_type and keywords that improve search qualityrequire_license_review: true in evaluationfilters.max_price or budget to limit spendingguixuhub, huggingface, ipfs before paid sourcesIf dataset_search returns no results:
If dataset_evaluate fails:
If dataset_download fails:
source:owner/datasetSearches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.
npx claudepluginhub guixu-project/guixu