Search everything...

Stats

Actions

Available In

local-semantic-search

Name: local-semantic-search
Author: adilasif

By adilasif

Semantic code search using Qdrant and local embeddings

npx claudepluginhub adilasif/local-semantic-search-claude-code-plugin --plugin local-semantic-search

Popularity

Stars

Med: 0·Avg: 285

Installs

Med: 0·Avg: 1

What's Inside

Skills3

codebase-onboard

/codebase-onboard

This skill should be used when the user is new to a codebase and wants to understand its structure, architecture, or main components. Trigger phrases include "help me understand this codebase", "I'm new to this project", "give me an overview", "what are the main components", "how is this structured", or "onboard me". Use for initial exploration, not for specific questions.

index-manage

/index-manage

This skill should be used when the user wants to create, update, delete, or troubleshoot semantic search indexes. Trigger phrases include "index this codebase", "update the index", "refresh the index", "delete the collection", "check index status", "semantic search isn't working", or "why isn't it finding". Use for index administration, not for searching.

semantic-explore

/semantic-explore

This skill should be used when the user asks to understand code behavior, explore how something works, find where functionality is implemented, or asks questions like "how does X work", "where is Y handled", "what does Z do", "show me the code that", "find the implementation of", or "trace the flow of". Also use for exploring unfamiliar codebases or when grep/glob returns too many irrelevant results. Use BEFORE or ALONGSIDE keyword search.

Hooks1

Event Hooks

2 hooks across 2 events

MCP Servers1

local-semantic-search

admin

README

Stats

Version0.2.0

LanguagePython

Stars0

MaintenanceFair

Last CommitMar 2, 2026

AddedMar 22, 2026

Actions

View on GitHub View README Plugin Marketplace JSON

Own this plugin?

Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).

Available In

local-semantic-search

Safety Signals

Critical

Admin access level

Server config contains admin-level keywords

Local Semantic Search

A Claude Code plugin for GPU-accelerated semantic code search. Self-hosted embedding, reranking, and vector search — accessible locally or remotely over Tailscale.

Architecture

Three containerized services work together:

Embedding Service — Jina Code Embeddings 0.5B running on HuggingFace's Text Embeddings Inference (TEI), with custom builds for SM 12.0 (Blackwell) and CUDA 13.1. Includes performance-optimized Flash Attention and link-time optimization (LTO).

Reranking Proxy — Jina Reranker v3 with a smart proxy layer that intercepts Qdrant search requests. Fetches top 100 vector results, then reranks with the cross-encoder before returning the top N. Uses TorchAO int4 quantization, Flash Attention 2, and listwise reranking architecture for throughput.

Vector Database — Qdrant for persistent vector storage with incremental indexing support.

Optimizations

GPU & Inference

Custom TEI builds with patched Candle, Candle-Extensions, and candle-index-select-cu for SM 12.0 (Blackwell) kernel support and CUDA 13.1
Performance-optimized Flash Attention with link-time optimization (LTO) and CPU-side optimizations
Flash Attention 2 with custom FBGEMM tuning for the reranker
TorchAO Int4 weight-only quantization (group_size=128) on the reranker — cuts VRAM without quality loss
float16 dtype for embeddings, bfloat16 for reranking
Parallel tokenization (4 workers) on the embedding model for preprocessing throughput
Configurable batch sizes and token limits tuned per available VRAM

Container Tuning

CPU pinning via cpuset — embedding and reranking isolated to separate core groups to prevent contention
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True for efficient CUDA memory allocation
Unlimited memlock (ulimits.memlock: -1) and 64MB stack for GPU workloads
pid: host and ipc: host for shared memory access between GPU processes
OMP_NUM_THREADS and MKL_NUM_THREADS pinned to match cpuset allocation

Reranking Pipeline

Query-passage asymmetric prefixing for better retrieval quality
LRU query cache (1000 entries, 60s TTL) correlates embedding vectors back to query text for reranking
Listwise batched reranking (batch size 64) with score threshold filtering
Selective filtering — narrow or empty results when relevance is low, forcing query refinement rather than polluting context with marginal matches
Graceful fallback to vector-only results if reranking fails

Indexing

Tree-sitter AST-aware chunking for Python, TypeScript, JavaScript, Go, Rust, Java, C/C++, Ruby, and more
Line-based fallback for unsupported languages (YAML, Markdown, SQL, etc.)
Deterministic UUID v5 point IDs for idempotent upserts
Incremental indexing via file hash comparison — only re-indexes changed files
Concurrent file processing (10 files) with batched embedding (60 segments per batch)

Reliability

Health checks that run actual inference requests (not just HTTP pings) every 30 seconds
restart: unless-stopped for automatic recovery from CUDA context corruption under WSL2
Health-gated service dependencies — the reranker waits for the embedding model to pass inference health checks before starting

Requirements

Docker with NVIDIA Container Toolkit
NVIDIA Blackwell GPU (SM 12.0) with 16GB+ VRAM
CUDA 13.1, drivers >= 590

Quick Start

docker compose -f docker-compose.semantic-search.yaml --profile semantic-search up -d

The plugin registers as a Claude Code MCP server, providing tools for semantic search, indexing, and collection management.

Remote Access

Install Tailscale on the GPU host and any remote machines. Then set environment variables on remote machines:

export QDRANT_URL="http://<tailscale-hostname>:6333"
export EMBEDDING_URL="http://<tailscale-hostname>:1335"

The plugin reads these at startup, falling back to localhost when unset.

VRAM Usage

Component	VRAM
Jina Code Embeddings 0.5B (float16)	~1 GB
Jina Reranker v3 (int4 quantized)	~6 GB
Total	~7-8 GB

Configurable via MAX_CLIENT_BATCH_SIZE and MAX_BATCH_TOKENS environment variables for constrained hardware.

local-semantic-search

Popularity

What's Inside

README

Confidence

Local Semantic Search

Architecture

Optimizations

GPU & Inference

Container Tuning

Reranking Pipeline

Indexing

Reliability

Requirements

Quick Start

Remote Access

VRAM Usage

Similar Plugins

lumen

beacon

cocoindex-code

colgrep

claude-turbo-search

vexor

More by adilasif

ooda

Local Semantic Search

Architecture

Optimizations

GPU & Inference

Container Tuning

Reranking Pipeline

Indexing

Reliability

Requirements

Quick Start

Remote Access

VRAM Usage

Popularity

Health & Quality

More by adilasif

ooda

Similar Plugins

lumen

beacon

cocoindex-code

colgrep

claude-turbo-search

vexor