Skill

ai-multimodal

Processes audio, images, videos, and PDFs, and generates images/videos using Google Gemini, Imagen, and Veo models. Useful for transcription, OCR, visual Q&A, document extraction, and media generation.

Python

OpenAI

ai-ml

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/bmad-skills:ai-multimodal

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

BashReadWriteEdit

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Process audio, images, videos, documents, and generate images/videos using Google Gemini's multimodal API.

Supporting Files

references/audio-processing.mdreferences/image-generation.mdreferences/video-analysis.mdreferences/video-generation.mdreferences/vision-understanding.mdscripts/check_setup.pyscripts/document_converter.pyscripts/gemini_batch_process.pyscripts/media_optimizer.pyscripts/requirements.txtscripts/tests/requirements.txtscripts/tests/test_document_converter.pyscripts/tests/test_gemini_batch_process.pyscripts/tests/test_media_optimizer.py

SKILL.md

70 lines · ~1.5k tokens

Stats

LanguageHTML

Stars11

Forks2

MaintenanceExcellent

Last CommitJun 18, 2026

Actions

View Source View Plugin View on GitHub View README

AI Multimodal

Process audio, images, videos, documents, and generate images/videos using Google Gemini's multimodal API.

Setup

export GEMINI_API_KEY="your-key"  # Get from https://aistudio.google.com/apikey
pip install google-genai python-dotenv pillow

Quick Start

Verify setup: python scripts/check_setup.py Analyze media: python scripts/gemini_batch_process.py --files <file> --task <analyze|transcribe|extract>

TIP: When you're asked to analyze an image, check if gemini command is available, then use "<prompt to analyze image>" | gemini -y -m gemini-2.5-flash command. If gemini command is not available, use python scripts/gemini_batch_process.py --files <file> --task analyze command. Generate content: python scripts/gemini_batch_process.py --task <generate|generate-video> --prompt "description"

Stdin support: You can pipe files directly via stdin (auto-detects PNG/JPG/PDF/WAV/MP3).

cat image.png | python scripts/gemini_batch_process.py --task analyze --prompt "Describe this"

python scripts/gemini_batch_process.py --files image.png --task analyze (traditional)

Models

Image generation: imagen-4.0-generate-001 (standard), imagen-4.0-ultra-generate-001 (quality), imagen-4.0-fast-generate-001 (speed)
Video generation: veo-3.1-generate-preview (8s clips with audio)
Analysis: gemini-2.5-flash (recommended), gemini-2.5-pro (advanced)

Scripts

gemini_batch_process.py: CLI orchestrator for transcribe|analyze|extract|generate|generate-video that auto-resolves API keys, picks sensible default models per task, streams files inline vs File API, and saves structured outputs (text/JSON/CSV/markdown plus generated assets) for Imagen 4 + Veo workflows.
media_optimizer.py: ffmpeg/Pillow-based preflight tool that compresses/resizes/converts audio, image, and video inputs, enforces target sizes/bitrates, splits long clips into hour chunks, and batch-processes directories so media stays within Gemini limits.
document_converter.py: Gemini-powered converter that uploads PDFs/images/Office docs, applies a markdown-preserving prompt, batches multiple files, auto-names outputs under docs/assets, and exposes CLI flags for model, prompt, auto-file naming, and verbose logging.
check_setup.py: Interactive readiness checker that verifies directory layout, centralized env resolver, required Python deps, and GEMINI_API_KEY availability/format, then performs a live Gemini API call and prints remediation instructions if anything fails.

Use --help for options.

References

Load for detailed guidance:

Topic	File	Description
Audio	`references/audio-processing.md`	Audio formats and limits, transcription (timestamps, speakers, segments), non-speech analysis, File API vs inline input, TTS models, best practices, cost and token math, and concrete meeting/podcast/interview recipes.
Images	`references/vision-understanding.md`	Vision capabilities overview, supported formats and models, captioning/classification/VQA, detection and segmentation, OCR and document reading, multi-image workflows, structured JSON output, token costs, best practices, and common product/screenshot/chart/scene use cases.
Image Gen	`references/image-generation.md`	Imagen 4 and Gemini image model overview, generate_images vs generate_content APIs, aspect ratios and costs, text/image/both modalities, editing and composition, style and quality control, safety settings, best practices, troubleshooting, and common marketing/concept-art/UI scenarios.
Video	`references/video-analysis.md`	Video analysis capabilities and supported formats, model/context choices, local/inline/YouTube inputs, clipping and FPS control, multi-video comparison, temporal Q&A and scene detection, transcription with visual context, token and cost guidance, and optimization/best-practice patterns.
Video Gen	`references/video-generation.md`	Veo model matrix, text-to-video and image-to-video quick start, multi-reference and extension flows, camera and timing control, configuration (resolution, aspect, audio, safety), prompt design patterns, performance tips, limitations, troubleshooting, and cost estimates.

Limits

Formats: Audio (WAV/MP3/AAC, 9.5h), Images (PNG/JPEG/WEBP, 3.6k), Video (MP4/MOV, 6h), PDF (1k pages) Size: 20MB inline, 2GB File API

ai-multimodal

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

ai-multimodal

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

AI Multimodal

Setup

Quick Start

Models

Scripts

References

Limits

Resources

Similar Skills

AI Multimodal

Setup

Quick Start

Models

Scripts

References

Limits

Resources

Similar Skills