From claudio-plugin
Troubleshoot and analyze logs from AWS CloudWatch Logs. This skill should be used when the user asks to investigate logs, troubleshoot application issues, query log groups, analyze error patterns, or perform log analysis for machines writing to AWS CloudWatch. Uses the AWS CLI for CloudWatch Logs operations.
How this skill is triggered — by the user, by Claude, or both
Slash command
/claudio-plugin:aws-log-analyzerThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Troubleshoot and analyze logs from AWS CloudWatch Logs - AWS's centralized logging service for applications and infrastructure.
scripts/analyze_errors.shscripts/find_recent_errors.shscripts/insights_queries.jsonscripts/lib/normalize_errors.shscripts/lib/query_helpers.shscripts/list_log_groups.shscripts/list_log_streams.shscripts/noise-patterns.txtscripts/run_insights_query.shscripts/tail_logs.shscripts/trace_request.shscripts/view_state.shTroubleshoot and analyze logs from AWS CloudWatch Logs - AWS's centralized logging service for applications and infrastructure.
Prerequisites:
aws CLI is installed and configuredInstallation: Use the centralized tool installation scripts:
# Check and install AWS CLI (required)
../../../tools/aws-cli/install.sh
# Check and install jq (optional, recommended)
../../../tools/jq/install.sh
Always follow this pattern:
Use CloudWatch Logs Insights for all error analysis - it supports case-insensitive regex, which is essential because logs may contain "error", "Error", or "ERROR" in different formats.
All scripts output JSON by default to make results easy to parse programmatically by AI assistants and automation tools.
⚠️ RECOMMENDATION: Use full JSON output for typical error analysis
For most use cases, direct JSON output is more efficient than state management:
Only use --save-state (via state management scripts) if:
Recommended approach - get all data in one call:
# Run analysis and capture full JSON output
OUTPUT=$(./scripts/analyze_errors.sh <log-group> 24)
# Parse specific fields as needed
echo "$OUTPUT" | jq '.total_errors'
echo "$OUTPUT" | jq '.by_severity'
echo "$OUTPUT" | jq '.top_errors[:5]'
# Default: JSON output to stdout, progress to stderr
./scripts/analyze_errors.sh /aws/app/myapp 24
# Output:
{
"log_group": "/aws/app/myapp",
"hours_analyzed": 24,
"total_errors": "1247",
"by_severity": {
"critical": 15,
"error": 1200,
"warning": 25,
"failed": 7
},
"top_errors": [
{
"message": "Connection timeout to database",
"count": 342,
"percentage": 27.43,
"pattern": "Connection timeout to database"
},
...
],
"critical_errors": [...],
"top_errors_by_pattern": [
{
"pattern": "Error at <TIMESTAMP>",
"total_count": 450,
"occurrences": 12,
"examples": [
{"message": "Error at 2026-02-06 15:30:45", "count": 120},
{"message": "Error at 2026-02-06 16:45:12", "count": 95}
]
},
...
],
"hourly_distribution": [...],
"comparison": null // or populated if --compare-previous is used
}
With additional flags:
# Exclude noise patterns and compare with previous period
./scripts/analyze_errors.sh /aws/app/myapp 24 --exclude-noise --compare-previous
# Output includes comparison data:
{
...
"comparison": {
"current_period": {"total_errors": 1247, "hours": 24},
"previous_period": {"total_errors": 1050, "hours": 24},
"change": "+18.76%",
"trend": "increasing"
}
}
Benefits:
Add --human flag for human-readable table/text format:
./scripts/analyze_errors.sh /aws/app/myapp 24 --human
# Output:
=== Error Analysis Results ===
Log Group: /aws/app/myapp
Total Errors: 1247
Top Errors by Frequency:
342x: Connection timeout to database
125x: Authentication failed...
...
In the model context:
# Extract specific field
./scripts/analyze_errors.sh /aws/app/myapp 24 | jq '.total_errors'
# Count distinct error types
./scripts/analyze_errors.sh /aws/app/myapp 24 | jq '.top_errors | length'
# Get top 3 errors
./scripts/analyze_errors.sh /aws/app/myapp 24 | jq '.top_errors[:3]'
⚠️ IMPORTANT: State management is for advanced use cases only
The state management system (~/.aws-log-analyzer/state/) is available for very large datasets or multi-step workflows, but is NOT recommended for typical error analysis.
Shared Library: This skill uses claudio-plugin/tools/memory/scripts/state.sh - a shared state management library used across multiple skills.
Why direct JSON output is better:
Note: analyze_errors.sh no longer supports the --save-state flag. Use direct JSON output instead (see examples above).
If you need state management for very large datasets (100K+ entries), you can manually save/load data:
# Capture output and save manually if needed
OUTPUT=$(./scripts/analyze_errors.sh <log-group> 24)
echo "$OUTPUT" > /tmp/analysis_result.json
# Later, load and parse
jq '.top_errors[:10]' /tmp/analysis_result.json
View saved state:
# List all saved states
./scripts/view_state.sh
# View specific state by ID
./scripts/view_state.sh analyze_errors_1707224567
# View latest state for an operation
./scripts/view_state.sh analyze_errors
Use --save-state when:
Don't use --save-state when:
Once data is saved, you can extract specific information using jq:
# Get the total error count
./scripts/view_state.sh analyze_errors | jq '.total_errors'
# Get top 5 errors with their counts
./scripts/view_state.sh analyze_errors | jq '.top_errors[0:5][] | {message: .message, count: .count}'
# Get only critical errors (count > 10)
./scripts/view_state.sh analyze_errors | jq '.critical_errors[] | select(.count > 10)'
# Get severity breakdown
./scripts/view_state.sh analyze_errors | jq '.by_severity'
# Get errors from a specific time bucket
./scripts/view_state.sh analyze_errors | jq '.hourly_distribution[] | select(.time_bucket | contains("2026-02-06T15"))'
# Extract error patterns (grouped by similarity)
./scripts/view_state.sh analyze_errors | jq '.top_errors_by_pattern[0:5]'
# Get all errors matching a specific pattern
./scripts/view_state.sh analyze_errors | jq '.top_errors[] | select(.pattern | contains("Connection"))'
# Get percentage of errors that are critical
./scripts/view_state.sh analyze_errors | jq '(.by_severity.critical / (.total_errors | tonumber) * 100)'
# Compare current vs previous period (if --compare-previous was used)
./scripts/view_state.sh analyze_errors | jq '.comparison'
# Get examples of a specific error pattern
./scripts/view_state.sh analyze_errors | jq '.top_errors_by_pattern[0].examples'
Example workflow using saved state:
# Step 1: Analyze errors with state saving
./scripts/analyze_errors.sh /aws/app/myapp 24 --save-state --exclude-noise
# Output:
# {
# "operation": "analyze_errors",
# "state_saved": true,
# "state_id": "analyze_errors_1738858234",
# "summary": {
# "log_group": "/aws/app/myapp",
# "total_errors": 1247,
# "top_error_patterns": [...]
# }
# }
# Step 2: Query specific details without re-running analysis
./scripts/view_state.sh analyze_errors | jq '.by_severity'
# Output: {"critical": 15, "error": 1200, "warning": 25, "failed": 7}
# Step 3: Get top 3 error patterns
./scripts/view_state.sh analyze_errors | jq '.top_errors_by_pattern[0:3]'
# Step 4: Find all errors with high frequency (> 50 occurrences)
./scripts/view_state.sh analyze_errors | jq '.top_errors[] | select(.count > 50)'
All operations are performed through the following scripts:
view_state.sh - View saved script outputs (for advanced workflows only)
analyze_errors.sh no longer supports --save-state flaglist_log_groups.sh - List available log groupslist_log_streams.sh - List log streams within a groupanalyze_errors.sh - Complete error analysis (recommended for most cases)
--human, --exclude-noise, --compare-previousfind_recent_errors.sh - Quick search for recent errorsrun_insights_query.sh - Execute custom CloudWatch Logs Insights queriestrace_request.sh - Trace a request ID across multiple log groupstail_logs.sh - Monitor logs in real-timePre-built CloudWatch Logs Insights queries are available in scripts/insights_queries.json:
Error Analysis:
error_analysis.total_count - Count total errorserror_analysis.by_message - Group errors by messageerror_analysis.unique_errors - Find unique errors (excludes noise)error_analysis.hourly_distribution - Hourly error distributionerror_analysis.recent_errors - Last 100 errorsPerformance Analysis:
performance_analysis.slow_requests - Requests slower than 1sperformance_analysis.latency_percentiles - P50, P90, P99 latenciesperformance_analysis.requests_per_minute - Request rateRequest Tracing:
request_tracing.by_request_id - Trace by request IDrequest_tracing.by_user - Trace by user IDApplication Monitoring:
application_monitoring.status_codes - HTTP status code distributionapplication_monitoring.error_rate - Error rate percentageapplication_monitoring.top_endpoints - Most accessed endpointsUser Request: "Analyze errors for in the last 24 hours"
Execution Sequence:
# Step 1: Run complete error analysis
./scripts/analyze_errors.sh <log-group-name> 24
Output Provides:
Recommended approach - capture output and parse as needed:
# Step 1: Run analysis and capture full JSON output
OUTPUT=$(./scripts/analyze_errors.sh <log-group-name> 24)
# Step 2: Extract specific fields
echo "$OUTPUT" | jq '.total_errors'
# Output: "1247"
echo "$OUTPUT" | jq '.by_severity'
# Output: {"critical": 15, "error": 1200, "warning": 25, "failed": 7}
echo "$OUTPUT" | jq '.top_errors[:3]'
# Output: Array of top 3 errors with counts and percentages
# Step 3: Get more details on specific errors if needed
./scripts/find_recent_errors.sh <log-group-name> 1 50
Why this is efficient:
User Request: "Check for errors in my application"
Execution Sequence:
# Step 1: Find the log group
./scripts/list_log_groups.sh /aws/application
# Step 2: Analyze errors in the identified log group
./scripts/analyze_errors.sh <log-group-name> 24
User Request: "Trace request ID abc-123 through all services"
Execution Sequence:
# Single command to search all log groups with a common prefix
./scripts/trace_request.sh abc-123 /aws/myapp 24
Output: Shows all log entries containing the request ID, sorted by timestamp, across all log groups.
User Request: "Watch for OutOfMemory errors in real-time"
Execution Sequence:
# Tail logs with filter pattern
./scripts/tail_logs.sh <log-group-name> "OutOfMemoryError" 1h
Time formats: 1h, 30m, 2d, 5s
User Request: "Find all authentication failures in the last 6 hours"
Execution Sequence:
# Step 1: Run custom Insights query
./scripts/run_insights_query.sh <log-group-name> 6 \
'fields @timestamp, @message | filter @message like /(?i)(auth|authentication)/ and @message like /(?i)(fail|denied)/ | sort @timestamp desc | limit 100'
Alternative using template query:
# Step 1: Load query from template (if you have one defined)
QUERY=$(jq -r '.custom_queries.auth_failures' scripts/insights_queries.json)
# Step 2: Run the query
./scripts/run_insights_query.sh <log-group-name> 6 "$QUERY"
User Request: "Find slow database queries in the last 24 hours"
Execution Sequence:
# Step 1: Use performance template query
QUERY=$(jq -r '.performance_analysis.slow_requests' scripts/insights_queries.json)
# Step 2: Run the query
./scripts/run_insights_query.sh <log-group-name> 24 "$QUERY"
For RDS slow query logs:
# Custom query for RDS slow query format
./scripts/run_insights_query.sh /aws/rds/instance/mydb/slowquery 24 \
'fields @timestamp, query_time, lock_time, rows_examined, @message | parse @message /Query_time: (?<qt>[0-9.]+)\s+Lock_time: (?<lt>[0-9.]+).*\n(?<query>.*)/ | filter qt > 1.0 | sort qt desc | limit 20'
User Request: "What's happening in my application right now?"
Execution Sequence:
# Step 1: List recent log streams to see activity
./scripts/list_log_streams.sh <log-group-name> 10
# Step 2: Tail recent logs
./scripts/tail_logs.sh <log-group-name> "" 10m
# Step 3: If errors are seen, analyze them
./scripts/analyze_errors.sh <log-group-name> 1
User Request: "Has the error rate increased in the last hour?"
Execution Sequence:
# Step 1: Get errors from last hour
./scripts/analyze_errors.sh <log-group-name> 1
# Step 2: Get errors from previous hour for comparison
./scripts/run_insights_query.sh <log-group-name> 2 \
'fields @timestamp | filter @message like /(?i)(error|fail|exception|critical)/ | stats count() as error_count by bin(1h)'
When combining this skill with other skills (especially gitlab-job-analyzer):
See the complete optimization guide in the main CLAUDE.md documentation under "Performance Optimization for Cross-Skill Analysis".
Key optimizations:
Example - Optimized cross-skill analysis:
# SINGLE MESSAGE - Parallel execution:
# Tool 1: AWS log analysis for component 1
./aws-log-analyzer/scripts/analyze_errors.sh /aws/app/component1 24
# Tool 2: AWS log analysis for component 2 (runs in parallel)
./aws-log-analyzer/scripts/analyze_errors.sh /aws/app/component2 24
# Tool 3: GitLab analysis (runs in parallel)
./gitlab-job-analyzer/scripts/analyze_recent_jobs.sh owner/repo --hours 24
# Then parse AWS output multiple ways without re-running:
echo "$AWS_OUTPUT" | jq '.total_errors'
echo "$AWS_OUTPUT" | jq '.by_severity'
echo "$AWS_OUTPUT" | jq '.top_errors[:5]'
echo "$AWS_OUTPUT" | jq '.hourly_distribution'
Expected performance:
Logs may contain "error", "Error", or "ERROR" - always use case-insensitive regex in CloudWatch Logs Insights queries:
/(?i)error/ in Insights queries"ERROR" filter patterns (case-sensitive)For most error investigations, use analyze_errors.sh first:
Leverage scripts/insights_queries.json for common analysis patterns:
Noise patterns are defined in scripts/noise-patterns.txt:
file already closedSlowDown, ThrottlingException, TooManyRequestsExceptionRequestLimitExceeded, Throttled, RequestThrottledProvisionedThroughputExceededExceptionUsage:
# Enable noise filtering (uses patterns from noise-patterns.txt)
./scripts/analyze_errors.sh <log-group> 24 --exclude-noise
# View all noise patterns
cat scripts/noise-patterns.txt
# Add custom patterns (edit the file)
echo "MyCustomNoisePattern" >> scripts/noise-patterns.txt
Note: Patterns are applied as case-insensitive regex patterns in CloudWatch Logs Insights queries.
Time range guidelines:
Narrower time ranges:
Error messages often differ only in timestamps, IPs, or IDs:
Pattern normalization groups similar errors automatically:
# The output includes both individual errors and pattern-grouped errors
./scripts/analyze_errors.sh <log-group> 24
# Query pattern-grouped errors from saved state
./scripts/view_state.sh analyze_errors | jq '.top_errors_by_pattern'
Pattern output structure:
{
"pattern": "Error at <TIMESTAMP>",
"total_count": 450,
"occurrences": 12,
"examples": [
{"message": "Error at 2026-02-06 15:30:45", "count": 120},
{"message": "Error at 2026-02-06 16:45:12", "count": 95}
]
}
Benefits:
For microservices/distributed architectures:
trace_request.sh to follow requests across servicesCause: Case-sensitive filter patterns don't match logs
Solution:
# ✅ Use Insights-based scripts (case-insensitive)
./scripts/analyze_errors.sh <log-group-name> 24
./scripts/find_recent_errors.sh <log-group-name> 1
Cause: Broad search across large time range
Solution:
# Step 1: Narrow time range
./scripts/analyze_errors.sh <log-group-name> 1 # Last hour instead of 24
# Step 2: Filter by specific pattern
./scripts/run_insights_query.sh <log-group-name> 1 \
'fields @timestamp, @message | filter @message like /(?i)OutOfMemory/ | limit 50'
Cause: Query too complex or time range too large
Solution:
# Step 1: Reduce time range
./scripts/analyze_errors.sh <log-group-name> 1 # Instead of 24
# Step 2: Simplify query (remove complex parsing)
# Step 3: Add more specific filters early in the query
./scripts/run_insights_query.sh <log-group-name> 1 \
'fields @timestamp, @message | filter @message like /(?i)specific_error/ | stats count()'
Cause: Typo or wrong region
Solution:
# Step 1: List all log groups to verify name
./scripts/list_log_groups.sh
# Step 2: Search with prefix
./scripts/list_log_groups.sh /aws/application
# Step 3: Verify AWS region is correct (check AWS CLI config)
Workflow: Correlate K8s pod events with application logs
# Step 1: Get pod name from kubernetes skill
# (kubernetes skill command)
# Step 2: Search CloudWatch logs for that pod
./scripts/trace_request.sh <pod-name> /aws/application 24
Workflow: Investigate deployment-related errors
# Step 1: Get commit SHA from gitlab skill
# (gitlab skill command)
# Step 2: Search logs for that deployment
./scripts/trace_request.sh <commit-sha> /aws/application 24
# Step 3: Analyze errors during deployment window
./scripts/analyze_errors.sh /aws/application/myapp 1
AWS Services:
/aws/lambda/<function-name>
/aws/rds/instance/<instance-id>/*
/aws/ecs/containerinsights/<cluster>
/aws/eks/<cluster>/cluster
/aws/apigateway/<api-id>/<stage>
Application Logs:
/aws/application/<app-name>
/var/log/messages
/aws/containerinsights/<cluster>/*
All commands output JSON by default. Add --human for human-readable format.
# Analyze errors (JSON output) - RECOMMENDED
OUTPUT=$(./scripts/analyze_errors.sh <log-group> 24)
echo "$OUTPUT" | jq '.total_errors'
# Analyze errors (human-readable)
./scripts/analyze_errors.sh <log-group> 24 --human
# Analyze errors with noise filtering and comparison
./scripts/analyze_errors.sh <log-group> 24 --exclude-noise --compare-previous
# Find recent errors
./scripts/find_recent_errors.sh <log-group> 1
# Trace a request
./scripts/trace_request.sh <request-id> <log-group-prefix> 24
# Monitor in real-time
./scripts/tail_logs.sh <log-group>
# List log groups
./scripts/list_log_groups.sh
# Custom query
./scripts/run_insights_query.sh <log-group> 24 '<insights-query>'
# Parse JSON output
./scripts/analyze_errors.sh <log-group> 24 | jq '.total_errors'
./scripts/list_log_groups.sh | jq '.log_groups[].name'
For most scripts (hours):
1 = last 1 hour24 = last 24 hours168 = last 7 daysFor tail_logs.sh (relative time):
1h = last hour30m = last 30 minutes2d = last 2 daysRequired:
aws CLI v2 (recommended) or v1
../../../tools/aws-cli/install.sh../../../tools/aws-cli/install.sh --checkOptional (recommended):
jq - JSON processor for parsing outputs
../../../tools/jq/install.sh../../../tools/jq/install.sh --checkAll installation scripts:
npx claudepluginhub aipcc-cicd/claudio-skills --plugin claudio-pluginQuery logs, list and manage sources, perform structured searches with SQL-like queries, set up log-based alerts, and analyze logs in Better Stack (Logtail).
Monitors AWS with CloudWatch for logs, metrics, alarms and CloudTrail for API events using boto3 and CLI commands. Useful for log queries, metric data, alarm status, or audit trails.
Queries OpenSearch logs using PPL for severity filtering, trace correlation, error patterns, and volume analysis in OTEL indices.