Semantic code search for Claude Code using Qdrant and local embeddings
npx claudepluginhub adilasif/local-semantic-search-claude-code-pluginSemantic code search using Qdrant and local embeddings
A Claude Code plugin for GPU-accelerated semantic code search. Self-hosted embedding, reranking, and vector search — accessible locally or remotely over Tailscale.
Three containerized services work together:
Embedding Service — Jina Code Embeddings 0.5B running on HuggingFace's Text Embeddings Inference (TEI), with custom builds for SM 12.0 (Blackwell) and CUDA 13.1. Includes performance-optimized Flash Attention and link-time optimization (LTO).
Reranking Proxy — Jina Reranker v3 with a smart proxy layer that intercepts Qdrant search requests. Fetches top 100 vector results, then reranks with the cross-encoder before returning the top N. Uses TorchAO int4 quantization, Flash Attention 2, and listwise reranking architecture for throughput.
Vector Database — Qdrant for persistent vector storage with incremental indexing support.
cpuset — embedding and reranking isolated to separate core groups to prevent contentionPYTORCH_CUDA_ALLOC_CONF=expandable_segments:True for efficient CUDA memory allocationulimits.memlock: -1) and 64MB stack for GPU workloadspid: host and ipc: host for shared memory access between GPU processesOMP_NUM_THREADS and MKL_NUM_THREADS pinned to match cpuset allocationrestart: unless-stopped for automatic recovery from CUDA context corruption under WSL2docker compose -f docker-compose.semantic-search.yaml --profile semantic-search up -d
The plugin registers as a Claude Code MCP server, providing tools for semantic search, indexing, and collection management.
Install Tailscale on the GPU host and any remote machines. Then set environment variables on remote machines:
export QDRANT_URL="http://<tailscale-hostname>:6333"
export EMBEDDING_URL="http://<tailscale-hostname>:1335"
The plugin reads these at startup, falling back to localhost when unset.
| Component | VRAM |
|---|---|
| Jina Code Embeddings 0.5B (float16) | ~1 GB |
| Jina Reranker v3 (int4 quantized) | ~6 GB |
| Total | ~7-8 GB |
Configurable via MAX_CLIENT_BATCH_SIZE and MAX_BATCH_TOKENS environment variables for constrained hardware.