From nbgrader-to-otter
This skill should be used when the user asks to "test an otter-grader notebook", "validate an instructor notebook", "run otter assign", "check autograder tests", "generate a test report", or mentions QA-ing otter-grader notebooks, running the testing pipeline, or checking for leaked solutions. Triggers after refactoring from nbgrader to otter-grader.
How this skill is triggered — by the user, by Claude, or both
Slash command
/nbgrader-to-otter:testing-otter-graderThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
<scope>
{name}.ipynbvalidate_structure.py passed (exit 0)otter-grader is installed: pip install otter-graderpython scripts/check_outputs.py {name}.ipynb
If status is "fail", solution cells are missing outputs. Execute the notebook
before proceeding. Without outputs, otter assign generates tests that reference
undefined variables, causing NameError failures in the autograder.
Local execution (preferred over JupyterHub):
cd {notebook_dir}
jupyter nbconvert --to notebook --execute --inplace \
--ExecutePreprocessor.allow_errors=True {name}.ipynb
The --allow-errors flag is essential: grader.check_all() and grader.export()
cells will error (otter test files don't exist yet), but solution cell outputs still
populate correctly. Run from the notebook's directory so relative paths resolve.
Shared data paths: If notebooks reference absolute paths (e.g., ~/shared/<assignment-name>/),
create symlinks to the downloaded data files:
ln -s /path/to/downloaded/data/<assignment-name> ~/shared/<assignment-name>
Without these, data-loading cells fail and all downstream solution cells produce NameErrors.
After execution, re-run check_outputs.py to confirm solution cells have outputs.
environment: environment.yml and the file exists with all
required packages. Gradescope builds strictly from this file — packages present in your
local environment but absent here will cause ImportError on Gradescope. Audit all
bundled .py helper files for transitive imports:
grep -h "^import\|^from" *.py | sort -u
Cross-reference every result against environment.yml before proceeding.
- [ ] Stage 1: Run otter assign
- [ ] Stage 2: Validate generated output structure
- [ ] Stage 3: Validate student notebook
- [ ] Stage 3.5: Student notebook coherence (optional)
- [ ] Stage 4: Run autograder tests against solutions
- [ ] Stage 5: Generate error report
Each stage depends on the previous. If a stage fails, continue running subsequent stages to collect maximum diagnostic information. The final report marks the pipeline as failed regardless.
python scripts/run_otter_assign.py {name}.ipynb dist/ > assign.log
Wraps otter assign with a 300-second timeout. Captures exit code, stdout/stderr, and
duration. On failure, check error_patterns against
references/error-patterns.md for the fix mapping.
<environment_specific_failures>
If otter assign fails because hidden tests produce AssertionError or NameError on
environment-dependent computations, use --no-run-tests to bypass local test validation:
otter assign --no-run-tests {name}.ipynb dist/
This is safe when environment.yml is properly configured — Gradescope will replicate
the correct environment using that file, so tests that fail locally due to package version
differences will pass on Gradescope. Common local-only failure causes:
scipy.optimize.linprog using different solvers across platforms (macOS/ARM64 vs Linux)Only use --no-run-tests when failures are clearly environment-specific, not logic errors.
Verify by checking that the solution code runs and produces reasonable output in the
executed notebook.
</environment_specific_failures>
If successful, dist/ contains autograder/ and student/ directories.
See references/expected-output-structure.md.
python scripts/validate_generated_output.py dist/ --config {name}.ipynb > structure.log
Checks that dist/autograder/ and dist/student/ contain expected contents: notebook,
autograder zip (non-empty, valid), otter_config.json, and companion files. The --config
flag extracts the files: list from the instructor notebook to verify companion files
are present in both directories.
Handled by generate_report.py (Stage 5) with --student-notebook and
--instructor-notebook flags. Performs static analysis:
<student_checks>
import otter; grader = otter.Notebook(...))# SOLUTION, # HIDDEN, # BEGIN TESTS, etc.)grader.check_all() and grader.export()
</student_checks>Leaked solution detection is the most critical check. If a solution line lacks the
# SOLUTION marker in the instructor notebook, otter assign will not strip it from
the student version — giving students the answer.
python3 scripts/eval_student_coherence.py dist/student/{name}.ipynb
The script prints an evaluation prompt to stdout. Read it and evaluate the student notebook as the student described — you are the judge, no external call needed. Produce a JSON array of gap findings and write it to coherence.json:
[
{"cell_index": 12, "description": "References 'model' which was never introduced", "severity": "high"},
{"cell_index": 24, "description": "Mentions 'the previous step' but no such step exists", "severity": "medium"}
]
If no gaps found, write []. Pass coherence.json to Stage 5:
python3 scripts/generate_report.py \
--notebook {name}.ipynb \
--assign-log assign.log \
--structure-log structure.log \
--student-notebook dist/student/{name}.ipynb \
--instructor-notebook {name}.ipynb \
--autograder-log autograder.log \
--coherence coherence.json \
--output report.json
Coherence findings appear in report.json under stages.coherence and never change pipeline_status — they are advisory. High-severity gaps should be flagged for human review before distribution.
python scripts/run_autograder_tests.py dist/autograder/{name}.ipynb \
dist/autograder/{name}-autograder_*.zip > autograder.log
Grades the autograder notebook (which contains solutions) against the autograder zip. Expected result: 100% score on all questions.
<failure_causes> Any failure indicates one of:
otter assign
</failure_causes>The per-question breakdown identifies exactly which questions fail with full tracebacks.
python3 scripts/generate_report.py \
--notebook {name}.ipynb \
--assign-log assign.log \
--structure-log structure.log \
--student-notebook dist/student/{name}.ipynb \
--instructor-notebook {name}.ipynb \
--autograder-log autograder.log \
--coherence coherence.json \
--output report.json
Aggregates all stage results, runs student notebook validation (Stage 3), maps errors
to issue types with fix actions, deduplicates, and writes report.json.
See references/report-schema.md for the format.
<reading_the_report>
If pipeline_status is "pass", the notebook is ready for distribution.
If "fail", read summary.fix_actions for a prioritized list of fixes. Each action
identifies the responsible agent and the specific change needed. The stages section
provides per-stage detail. For test failures, stages.autograder_tests.per_question
shows which questions failed with full tracebacks.
</reading_the_report>
<feedback_loop>
Hand report.json to the Refactoring Agent. It should:
summary.fix_actionscell_index, apply each fixvalidate_structure.pyThis loop converges within 2-3 iterations. If not, flag for manual review. </feedback_loop>
Standalone notebook executor for re-running notebooks when outputs are missing:
python scripts/run_notebook.py {name}.ipynb
Copies the notebook to a temp directory with companion files, executes via
jupyter nbconvert, and returns per-cell execution results as JSON.
Does not modify the original.
<common_failures> These failures recur across notebooks:
NameError in autograder tests. Solution cells have no outputs. The notebook was
not executed before otter assign. Fix: execute locally with
jupyter nbconvert --execute --allow-errors, then re-run otter assign.
FileNotFoundError: environment.yml. Assignment config references environment.yml
but the file doesn't exist in the notebook directory. Fix: create the file with all
required packages (scan notebook imports AND all bundled .py helper files for
transitive imports). Do NOT remove the environment: line — Gradescope needs it to
replicate the execution environment.
Transitive import from helper module missing in environment.yml. import helper_functions succeeds locally because your local environment (conda base, Docker
image, or system Python) may have packages installed that are not in environment.yml.
On Gradescope, the environment is built strictly from environment.yml — nothing else.
If helper_functions.py imports seaborn and seaborn is not in environment.yml,
the import fails — and because import os, pathlib and path setup are in the SAME
cell, they also fail. The symptom is NameError: path_data not defined, which looks
like a data-loading bug, not a missing package. Diagnosis: check every bundled .py
file's imports against environment.yml before assuming anything else is wrong:
grep -h "^import\|^from" *.py | sort -u
Test references variable outside question block. Tests depend on data loaded in
cells before # BEGIN QUESTION. Fix: add data setup inside the test cell or use
try/except (wrap the assertion in try/except NameError: pass so a missing variable
does not fail the whole test suite).
Provided computation cell placed after # END QUESTION. Pattern: student solves
means and standard_deviations; a non-solution cell computes standardized_table
from those; the test checks standardized_table. During conversion the computation
cell lands AFTER END QUESTION, so it hasn't run when the test fires. Fix: move
the provided computation cell to between # END SOLUTION and # BEGIN TESTS.
"Public Tests" phantom failure (0/1 points). Otter-grader artifact from
grader.check_all(). No fix needed — this is expected behavior. All actual per-question
tests pass; only the synthetic "Public Tests" summary entry shows 0/1.
FileNotFoundError on shared data paths. Notebooks reference absolute paths like
~/shared/<assignment-name>/ that only exist on JupyterHub. Fix: create symlinks
from the expected path to the local downloaded data directory.
AssertionError in hidden tests (environment-specific). scipy, numpy, or random
seed differences between local machine and Gradescope produce slightly different
numerical results. Fix: use otter assign --no-run-tests locally. With a proper
environment.yml in the zip, Gradescope replicates the correct environment and
these tests will pass there.
</common_failures>
dist/ contentsreport.jsonProvides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.
npx claudepluginhub nyu-tandon-tmi/nbgrader-to-otter-grader --plugin nbgrader-to-otter