By phy-ren
Fetch research datasets (HuggingFace / GitHub / direct URL) into /home/datasets/<slug>/ with a MANIFEST. Idempotent, shared across users.
Fetch research datasets into /home/datasets/<slug>/ as ready-to-use files. A slug may combine multiple sources (GitHub + HuggingFace + direct URLs) into one directory. Archives auto-extract. An optional expects contract rejects silent partials.
sudo mkdir -p /home/datasets && sudo chmod 2777 /home/datasets
Register the plugin in ~/.claude/settings.json:
{
"extraKnownMarketplaces": {
"datasets-tools": {
"source": { "source": "directory", "path": "/home/xingyu/datasets-tools" }
}
},
"enabledPlugins": {
"datasets-tools@datasets-tools": true
}
}
HuggingFace support shells out to huggingface-cli. Install once per machine:
pip install -U 'huggingface_hub[cli]'
# register a multi-source slug with a size contract
uv run ~/datasets-tools/dataset_tool.py add nmrgym \
--gh AIMS-Lab-HKUSTGZ/NMRGym --hf meaw0415/NMRGym \
--expect-min-size-mb 50
# pull everything
uv run ~/datasets-tools/dataset_tool.py fetch nmrgym
# check state
uv run ~/datasets-tools/dataset_tool.py list
uv run ~/datasets-tools/dataset_tool.py manifest nmrgym
# regenerate MANIFEST from what's already on disk — no refetch
# (use after editing the registry or upgrading the tool)
uv run ~/datasets-tools/dataset_tool.py remanifest nmrgym
fetch <slug> runs:
huggingface-cli download, and HTTP streaming (each configured source, merged into the same target dir). HTTP writes to <file>.part and renames on success.expects.min_size_mb and expects.contains.source (registry intent) and method (what actually ran this invocation — e.g. source: http, method: manual when an agent drops files past a WAF); elapsed is annotated with what it measures (fetch, promote only, remanifest); requires lists load-time Python deps (declared + auto-detected).Any failure during steps 1–4 writes DOWNLOAD_ME.md with recovery steps and exits 2 instead of MANIFEST-ing a partial directory.
<slug>:
gh_repo: OWNER/NAME # optional, cloned first
gh_ref: BRANCH/TAG/COMMIT # optional
hf_dataset: OWNER/NAME # optional, downloaded via huggingface-cli
hf_subdir: DIR # optional, placed under this relative subdir
hf_revision: REV # optional
url: URL # optional singular
urls: [URL, ...] # optional plural (use when >1)
expects: # optional contract
min_size_mb: 50
contains: [data/train.csv]
requires: [pandas, rdkit] # optional, Python deps needed to load the files
caveat: post-fetch note about the data (license / citation / loading quirks)
source: derived "github+hf+http" label (read-only display)
requires is also auto-detected from serialized-object opcodes in .pkl files
and from file extensions (.parquet → pyarrow, .h5 → h5py, .npy → numpy,
.pt → torch, .safetensors → safetensors). Declared entries take precedence;
auto-detected entries are suffixed (auto) in MANIFEST.
caveat is for post-fetch realities. Fetch-time obstacles (bot challenges,
private shares, "manual download needed") belong in the pending flow — they
get surfaced in DOWNLOAD_ME.md while a slug is pending, then cleared when
MANIFEST is written. Any such sentences in caveat are auto-pruned at render
time so MANIFEST never contradicts ground truth.
At least one of gh_repo / hf_dataset / url(s) must be present.
| code | meaning |
|---|---|
| 0 | complete |
| 1 | error (network, filesystem, bad archive, missing tool) |
| 2 | pending — DOWNLOAD_ME.md written, manual completion required |
| 64 | usage |
list✓ complete — MANIFEST present, all listed entries on disk⋯ pending — DOWNLOAD_ME present, or interrupted fetch orphansFetches are idempotent (re-run is a no-op unless --force) and concurrent-safe (per-slug flock).
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
npx claudepluginhub phy-ren/datasets-tools --plugin datasets-toolsComprehensive skill pack with 66 specialized skills for full-stack developers: 12 language experts (Python, TypeScript, Go, Rust, C++, Swift, Kotlin, C#, PHP, Java, SQL, JavaScript), 10 backend frameworks, 6 frontend/mobile, plus infrastructure, DevOps, security, and testing. Features progressive disclosure architecture for 50% faster loading.
A growing collection of Claude-compatible academic workflow bundles. Covers scientific figures, manuscript writing and polishing, reviewer assessment, citation retrieval, data availability, paper reading, literature search, response letters, paper-to-PPTX conversion, and evidence-grounded Chinese invention patent drafting. Rules are organized as reusable skill folders with explicit workflows and quality checks.
Intelligent draw.io diagramming plugin with AI-powered diagram generation, multi-platform embedding (GitHub, Confluence, Azure DevOps, Notion, Teams, Harness), conditional formatting, live data binding, and MCP server integration for programmatic diagram creation and management.
Persistent file-based planning for AI coding agents. Crash-proof markdown plans (task_plan.md, findings.md, progress.md) that survive context loss and /clear, with an opt-in completion gate and multi-agent shared state. Manus-style. Works with Claude Code, Codex CLI, Cursor, Kiro, OpenCode and 60+ agents via the SKILL.md standard. Includes Arabic, German, Spanish, and Chinese (Simplified and Traditional).
Complete creative writing suite with 10 specialized agents covering the full writing process: research gathering, character development, story architecture, world-building, dialogue coaching, editing/review, outlining, content strategy, believability auditing, and prose style/voice analysis. Includes genre-specific guides, templates, and quality checklists.
Payload Development plugin - covers collections, fields, hooks, access control, plugins, and database adapters.