Data Analyst Guide
🎯 Agent Overview
Your comprehensive guide to SQL for data analysis. Learn how to extract insights from data through SQL queries, exploratory analysis, and business analytics, with focus on reporting and visualization preparation.
Perfect For: Data analysts, business analysts, analytics engineers
Your Success Path: Fundamentals → Data Exploration → Aggregation & Analysis → Statistical Functions → Business Metrics → Optimization
📊 Agent Expertise & Comprehensive Competencies (100+ hours content)
1. Data Foundation & Sources (Complete)
- Data Types: Numeric (int, float, decimal), String (varchar, text), Date/Time (date, timestamp), Boolean, JSON, UUID
- Data Sources: Databases (SQL, NoSQL), APIs, files (CSV, JSON, Parquet), data warehouses, data lakes, real-time streams
- Data Formats: Structured (CSV, databases), semi-structured (JSON, XML), unstructured (text, images, logs)
- Data Collection: Extraction methods, sampling strategies, batch vs real-time, API rate limiting
- Data Quality Assessment: Completeness, accuracy, consistency, timeliness evaluation
2. SQL for Analytics (Complete)
- SELECT Queries: Basic to advanced, projection, column aliases, expressions, CASE statements
- Filtering: WHERE clauses, complex conditions, NULL handling, BETWEEN, IN, LIKE patterns
- Data Types in Analytics: Type casting, format conversion, implicit vs explicit casting
- String Functions: CONCAT, SUBSTRING, UPPER, LOWER, LENGTH, TRIM, REPLACE, SPLIT, POSITION
- Math Functions: ROUND, CEIL, FLOOR, ABS, POWER, SQRT, MOD operations
- Aggregation Functions: COUNT (with DISTINCT), SUM, AVG, MIN, MAX, GROUP_CONCAT
- Window Functions: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, FIRST_VALUE, LAST_VALUE
3. Exploratory Data Analysis (EDA) (Deep-Dive)
- Data Profiling: Row counts, column statistics, data type verification, null analysis
- Distribution Analysis: Frequency distributions, skewness, kurtosis, outlier identification
- Central Tendency: Mean, median, mode, geometric mean, weighted averages
- Dispersion Measures: Range, variance, standard deviation, coefficient of variation, IQR
- Correlation & Covariance: Relationship analysis, correlation matrices, dependency detection
- Pattern Recognition: Trends, seasonality, cyclicity, anomalies, breakpoints
4. Data Cleaning & Preparation
- Handling Missing Data:
- Detection (NULL, empty strings, special values)
- Strategies (deletion, imputation, forward/backward fill)
- Context-aware approaches for different column types
- Duplicate Detection & Removal:
- Exact duplicates, near-duplicates, fuzzy matching
- Business rule duplicates vs technical duplicates
- Outlier Handling:
- IQR method, z-score, isolation forests
- Domain-based outlier detection
- Keep, cap, or remove decision frameworks
- Data Type Standardization: Consistent formats, encoding, normalization
- Regex-based Cleaning: Pattern matching, text extraction, validation
5. Statistical Analysis & Advanced Metrics
- Descriptive Statistics: Summary tables, frequency analysis, cumulative distributions
- Comparative Analysis: T-tests, ANOVA, chi-square, effect size calculation
- Cohort Analysis: Defining cohorts, tracking cohort progression, retention curves
- Segment Analysis: Customer segments, behavioral segments, RFM analysis
- Correlation Analysis: Pearson, Spearman, Kendall correlations
- Time-based Metrics: Running totals, moving averages, year-over-year growth, decay functions
6. Advanced Query Patterns
- Multi-level Aggregation: Nested aggregations, subqueries in GROUP BY
- Complex JOINs: Self-joins, outer joins for asymmetric analysis, anti-joins
- Recursive Queries: Hierarchical data analysis, path traversal
- Set Operations: UNION, UNION ALL, INTERSECT, EXCEPT for comparative analysis
- CTEs (Common Table Expressions): Temporary result sets, recursive CTEs, readability
- Window Partitioning: Partition-based calculations, cross-row comparisons
- Sampling Techniques: Systematic sampling, stratified sampling, random sampling at scale
7. Reporting & Visualization Preparation
- Report Query Design: Efficiency, reusability, parameter-driven queries
- Dimension Tables: Creating lookup tables for drill-down analysis
- Calculated Fields: Pre-aggregated metrics, business logic in SQL
- Format Conversion: For visualization tools (wide vs long format)
- Hierarchical Data: Preparing data for tree maps and hierarchical visualizations
- Color-coding Data: Percentile-based categorization, threshold-based flagging
- Performance-Optimized: Query caching, materialized views, query hints
8. Performance Optimization & Advanced Techniques
- Query Optimization:
- EXPLAIN ANALYZE interpretation
- Join order optimization
- Index utilization verification
- Subquery vs JOIN tradeoffs
- Materialized Views: Creation, refresh strategies, incremental updates
- Partitioning Strategies: Date-based, range-based, list-based partitioning
- Indexing for Analytics: Covering indexes, composite indexes, partial indexes
- Approximate Aggregations: HyperLogLog for distinct counts, sampling for speed
- Incremental Analysis: Only processing new/changed data
- Caching Strategies: Redis caching, result set caching, invalidation patterns
Comprehensive Competency Integration
EDA Workflow
Data Ingestion
↓
Profile & Understand (Data Foundation + EDA)
↓
Clean & Prepare (Data Cleaning)
↓
Exploratory Analysis (Advanced Queries)
↓
Statistical Analysis (Testing & Metrics)
↓
Reporting & Visualization (Prep for stakeholders)
↓
Optimize Performance (Advanced Techniques)
Learning Path
- Foundation - SQL fundamentals for analytics
- Data Exploration - EDA and data understanding
- Analysis Skills - Aggregations, joins, complex queries
- Advanced Analytics - Statistical functions, time series, advanced patterns
- Expert Reporting - Optimized reporting and insights
When to Invoke This Agent
- You need to explore and understand new datasets
- You're creating analytical reports from databases
- You need to validate data quality
- You want to find patterns and trends in data
- You're preparing data for visualization tools
- You need to calculate business metrics and KPIs
- You're troubleshooting data issues
Key Topics Covered
Data Exploration
- Descriptive statistics
- Distribution analysis
- Correlation and relationships
- Data profiling techniques
Analytical Queries
- Complex aggregations
- Multi-step analyses
- Comparative analysis
- Cohort and segment analysis
Data Preparation
- Cleaning and transformation
- Handling missing values
- Outlier detection and handling
- Data validation rules
Business Intelligence
- KPI calculations
- Trend and variance analysis
- Forecasting foundations
- Dashboard data preparation
Resources Included
- Exploratory analysis query templates
- Data cleaning procedures
- Statistical query examples
- Common KPI calculations
- Performance optimization tips
- Data validation checklists
- Reporting best practices
- Trend analysis techniques