Skill

content-hash-cache-pattern

Caches expensive file processing results (PDF parsing, text extraction, image analysis) using SHA-256 content hashing instead of file paths — cache survives renames and auto-invalidates on content changes.

Python

developer-tools

performance

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/everything-claude-code:content-hash-cache-pattern

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

使用 SHA-256 内容哈希作为缓存键来缓存高开销的文件处理结果（PDF 解析、文本提取、图像分析）。与基于路径的缓存不同，此方法在文件移动/重命名后仍然有效，并在内容变更时自动失效。

SKILL.md

162 lines · ~925 tokens

Stats

LanguageJavaScript

Stars16

Forks6

MaintenanceExcellent

Last CommitMay 31, 2026

Actions

View Source View Plugin View on GitHub View README

内容哈希文件缓存模式

何时激活

构建文件处理管道（PDF、图像、文本提取）
处理成本高且相同文件被重复处理
需要 --cache/--no-cache CLI 选项
想要为现有纯函数添加缓存而不修改它们

核心模式

1. 基于内容哈希的缓存键

使用文件内容（而非路径）作为缓存键：

import hashlib
from pathlib import Path

_HASH_CHUNK_SIZE = 65536  # 大文件使用 64KB 分块

def compute_file_hash(path: Path) -> str:
    """文件内容的 SHA-256 哈希（分块处理大文件）。"""
    if not path.is_file():
        raise FileNotFoundError(f"文件未找到: {path}")
    sha256 = hashlib.sha256()
    with open(path, "rb") as f:
        while True:
            chunk = f.read(_HASH_CHUNK_SIZE)
            if not chunk:
                break
            sha256.update(chunk)
    return sha256.hexdigest()

为什么用内容哈希？ 文件重命名/移动 = 缓存命中。内容变更 = 自动失效。不需要索引文件。

2. 冻结数据类作为缓存条目

from dataclasses import dataclass

@dataclass(frozen=True, slots=True)
class CacheEntry:
    file_hash: str
    source_path: str
    document: ExtractedDocument  # 缓存的结果

3. 基于文件的缓存存储

每个缓存条目存储为 {hash}.json — 按哈希 O(1) 查找，不需要索引文件。

import json
from typing import Any

def write_cache(cache_dir: Path, entry: CacheEntry) -> None:
    cache_dir.mkdir(parents=True, exist_ok=True)
    cache_file = cache_dir / f"{entry.file_hash}.json"
    data = serialize_entry(entry)
    cache_file.write_text(json.dumps(data, ensure_ascii=False), encoding="utf-8")

def read_cache(cache_dir: Path, file_hash: str) -> CacheEntry | None:
    cache_file = cache_dir / f"{file_hash}.json"
    if not cache_file.is_file():
        return None
    try:
        raw = cache_file.read_text(encoding="utf-8")
        data = json.loads(raw)
        return deserialize_entry(data)
    except (json.JSONDecodeError, ValueError, KeyError):
        return None  # 将损坏视为缓存未命中

4. 服务层包装器（SRP）

保持处理函数纯净。将缓存作为单独的服务层添加。

def extract_with_cache(
    file_path: Path,
    *,
    cache_enabled: bool = True,
    cache_dir: Path = Path(".cache"),
) -> ExtractedDocument:
    """服务层：缓存检查 -> 提取 -> 缓存写入。"""
    if not cache_enabled:
        return extract_text(file_path)  # 纯函数，无缓存感知

    file_hash = compute_file_hash(file_path)

    # 检查缓存
    cached = read_cache(cache_dir, file_hash)
    if cached is not None:
        logger.info("缓存命中: %s (hash=%s)", file_path.name, file_hash[:12])
        return cached.document

    # 缓存未命中 -> 提取 -> 存储
    logger.info("缓存未命中: %s (hash=%s)", file_path.name, file_hash[:12])
    doc = extract_text(file_path)
    entry = CacheEntry(file_hash=file_hash, source_path=str(file_path), document=doc)
    write_cache(cache_dir, entry)
    return doc

关键设计决策

决策	理由
SHA-256 内容哈希	路径无关，内容变更时自动失效
`{hash}.json` 文件命名	O(1) 查找，不需要索引文件
服务层包装器	SRP：提取保持纯净，缓存是独立的关注点
手动 JSON 序列化	完全控制冻结数据类的序列化
损坏返回 `None`	优雅降级，下次运行时重新处理
`cache_dir.mkdir(parents=True)`	首次写入时延迟创建目录

最佳实践

哈希内容，而非路径 — 路径会变，内容标识不会
哈希大文件时分块处理 — 避免将整个文件加载到内存
保持处理函数纯净 — 它们不应知道缓存的存在
记录缓存命中/未命中 并附带截断的哈希用于调试
优雅处理损坏 — 将无效缓存条目视为未命中，绝不崩溃

需要避免的反模式

# 差：基于路径的缓存（文件移动/重命名后会失效）
cache = {"/path/to/file.pdf": result}

# 差：在处理函数内部添加缓存逻辑（SRP 违规）
def extract_text(path, *, cache_enabled=False, cache_dir=None):
    if cache_enabled:  # 现在这个函数有两个职责
        ...

# 差：对嵌套冻结数据类使用 dataclasses.asdict()
# （可能导致复杂嵌套类型的问题）
data = dataclasses.asdict(entry)  # 改用手动序列化

何时使用

文件处理管道（PDF 解析、OCR、文本提取、图像分析）
受益于 --cache/--no-cache 选项的 CLI 工具
相同文件跨多次运行的批量处理
为现有纯函数添加缓存而不修改它们

何时不使用

必须始终是最新的数据（实时数据流）
缓存条目极其庞大的情况（考虑流式处理）
结果依赖于文件内容之外的参数（如不同的提取配置）

content-hash-cache-pattern

Popularity

Invocation

Context Preview

SKILL.md

content-hash-cache-pattern

Popularity

Invocation

Context Preview

SKILL.md

内容哈希文件缓存模式

何时激活

核心模式

1. 基于内容哈希的缓存键

2. 冻结数据类作为缓存条目

3. 基于文件的缓存存储

4. 服务层包装器（SRP）

关键设计决策

最佳实践

需要避免的反模式

何时使用

何时不使用

Similar Skills

内容哈希文件缓存模式

何时激活

核心模式

1. 基于内容哈希的缓存键

2. 冻结数据类作为缓存条目

3. 基于文件的缓存存储

4. 服务层包装器（SRP）

关键设计决策

最佳实践

需要避免的反模式

何时使用

何时不使用

Similar Skills