Analyse raw data quality using the ABS Data Quality Framework (7 dimensions). Use when the user asks to assess data quality, review a dataset, check data fitness for purpose, produce a data quality report, or evaluate data before analysis. Generates a structured markdown report covering institutional environment, relevance, timeliness, accuracy, coherence, interpretability, and accessibility.
How this skill is triggered — by the user, by Claude, or both
Slash command
/data-quality-analysis:data-quality-analysisThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Analyse raw data against the Australian Bureau of Statistics (ABS) Data Quality Framework and produce a structured quality report. Based on ABS Catalogue No. 1520.0 — ABS Data Quality Framework, May 2009.
Analyse raw data against the Australian Bureau of Statistics (ABS) Data Quality Framework and produce a structured quality report. Based on ABS Catalogue No. 1520.0 — ABS Data Quality Framework, May 2009.
This skill assesses data quality across the seven dimensions of the ABS Data Quality Framework:
The output is a standalone markdown report that can be used directly or populated into a formal quality declaration document.
Before analysing any dataset, you MUST conduct a discovery interview to understand the data and its context.
Data Source & Ownership
Purpose & Intended Use
Data Context
Known Issues
Quality Expectations
If the user provides a data file without context, proceed with what can be determined from the data itself and clearly note assumptions and unknowns in the report. Mark dimensions that cannot be assessed without external context as "Unable to assess — requires additional context" and list the specific information needed.
Load and inspect the data to establish a foundational understanding:
After initial profiling, compute a per-column quality scorecard. This quantifies six quality metrics for every column, producing a single table that gives the user an immediate, measurable view of data quality before the narrative dimension assessments.
| Metric | Definition | How to Calculate | Scale |
|---|---|---|---|
| Completeness | Proportion of non-null, non-blank values | (total - nulls - blanks) / total * 100 | 0–100% |
| Validity | Proportion of values that conform to the expected domain (type, range, format) | valid_values / non_null_values * 100 | 0–100% |
| Consistency | Proportion of values that match the dominant format/pattern within the column | values_matching_dominant_pattern / non_null_values * 100 | 0–100% |
| Uniqueness | Proportion of distinct values relative to total non-null values | unique_values / non_null_values * 100 | 0–100% |
| Timeliness | For date/time columns: recency score based on how current the most recent value is. For non-date columns: mark as N/A | See timeliness calculation below | 0–100% or N/A |
| Accuracy | Proportion of values that are plausible (not impossible, not extreme outliers) | (non_null - impossible - outliers) / non_null * 100 | 0–100% |
Completeness:
completeness = (len(df) - df[col].isna().sum() - (df[col] == '').sum()) / len(df) * 100
NaN/None AND empty strings/whitespace-only values as incompleteValidity: Assess whether values conform to the expected domain for the column type:
WO-YYYY-NNNNN)# Example for a numeric column that should be positive
valid = df[col].dropna().between(0, upper_bound).sum()
validity = valid / df[col].notna().sum() * 100
Consistency: Detect the dominant format/pattern and measure adherence:
# Example: check case consistency for a categorical column
values = df[col].dropna()
dominant_case = values.apply(str.title) # or str.lower, str.upper
consistency = (values == dominant_case).sum() / len(values) * 100
Uniqueness:
uniqueness = df[col].nunique() / df[col].notna().sum() * 100
Timeliness: Only applicable to date/datetime columns. Calculate as a recency score:
# For date columns: how recent is the most recent value?
if col is date type:
max_date = df[col].max()
days_since = (assessment_date - max_date).days
# Score: 100% if within 30 days, decaying linearly to 0% at 365+ days
timeliness = max(0, 100 - (days_since / 365 * 100))
else:
timeliness = "N/A"
Accuracy: Combine impossible value detection and outlier detection:
non_null = df[col].notna().sum()
impossible = count_impossible_values(df[col]) # domain-specific rules
outliers = count_outliers_iqr(df[col]) # IQR method for numeric
accuracy = (non_null - impossible - outliers) / non_null * 100
Impossible value rules (apply based on column semantics):
Present the results in this format:
| Column | Completeness | Validity | Consistency | Uniqueness | Timeliness | Accuracy | Issues |
|---|---|---|---|---|---|---|---|
| column_name | 98.5% | 99.2% | 100.0% | 45.3% | N/A | 97.8% | 23 outliers |
| Metric | HIGH (Green) | ADEQUATE (Amber) | LOW (Red) |
|---|---|---|---|
| Completeness | >= 95% | 80–94% | < 80% |
| Validity | >= 98% | 90–97% | < 90% |
| Consistency | >= 95% | 80–94% | < 80% |
| Uniqueness | Context-dependent | Context-dependent | Context-dependent |
| Timeliness | >= 80% | 50–79% | < 50% |
| Accuracy | >= 95% | 85–94% | < 85% |
Note on Uniqueness: Uniqueness thresholds depend on the column's role:
After computing all six metrics per column, derive a summary score for each column:
Column Score = weighted average of applicable metrics
Default weights (adjustable based on user priorities):
Present an overall dataset quality score as the average of all column scores.
import pandas as pd
import numpy as np
def column_quality_scorecard(df, date_cols=None, id_cols=None, assessment_date=None):
"""Compute per-column quality metrics for a DataFrame."""
if assessment_date is None:
assessment_date = pd.Timestamp.now()
if date_cols is None:
date_cols = []
if id_cols is None:
id_cols = []
results = []
for col in df.columns:
n = len(df)
non_null = df[col].notna().sum()
nulls = df[col].isna().sum()
# Completeness
blanks = 0
if df[col].dtype == 'object':
blanks = df[col].fillna('').apply(lambda x: str(x).strip() == '').sum() - nulls
blanks = max(blanks, 0)
completeness = (n - nulls - blanks) / n * 100 if n > 0 else 0
# Uniqueness
uniqueness = df[col].nunique() / non_null * 100 if non_null > 0 else 0
# Validity, Consistency, Accuracy — compute per data type
validity = np.nan
consistency = np.nan
accuracy = np.nan
timeliness = "N/A"
issues = []
if pd.api.types.is_numeric_dtype(df[col]):
vals = df[col].dropna()
# Validity: check for negative values in columns that should be positive
# (apply domain-specific rules as needed)
validity = 100.0 # default; override with domain rules
# Consistency: check precision consistency
if len(vals) > 0:
decimals = vals.apply(lambda x: len(str(x).split('.')[-1]) if '.' in str(x) else 0)
dominant = decimals.mode().iloc[0] if len(decimals.mode()) > 0 else 0
consistency = (decimals == dominant).sum() / len(vals) * 100
# Accuracy: IQR outlier detection
if len(vals) > 0:
q1, q3 = vals.quantile(0.25), vals.quantile(0.75)
iqr = q3 - q1
outlier_count = ((vals < q1 - 1.5 * iqr) | (vals > q3 + 1.5 * iqr)).sum()
negative_count = (vals < 0).sum()
accuracy = (len(vals) - outlier_count) / len(vals) * 100
if outlier_count > 0:
issues.append(f"{outlier_count} outliers")
if negative_count > 0:
issues.append(f"{negative_count} negatives")
elif col in date_cols or pd.api.types.is_datetime64_any_dtype(df[col]):
vals = pd.to_datetime(df[col], errors='coerce').dropna()
valid_dates = len(vals)
attempted = df[col].notna().sum()
validity = valid_dates / attempted * 100 if attempted > 0 else 0
consistency = 100.0 # after parsing; pre-parse format consistency requires string analysis
# Timeliness
if len(vals) > 0:
days_since = (assessment_date - vals.max()).days
timeliness = f"{max(0, 100 - (days_since / 365 * 100)):.0f}%"
# Accuracy: future dates, implausible past dates
future = (vals > assessment_date).sum()
accuracy = (len(vals) - future) / len(vals) * 100 if len(vals) > 0 else 0
if future > 0:
issues.append(f"{future} future dates")
elif df[col].dtype == 'object':
vals = df[col].dropna()
if len(vals) > 0:
# Validity: non-empty strings
valid = vals.apply(lambda x: str(x).strip() != '').sum()
validity = valid / len(vals) * 100
# Consistency: case consistency check
title_match = (vals == vals.str.title()).sum()
lower_match = (vals == vals.str.lower()).sum()
upper_match = (vals == vals.str.upper()).sum()
best_match = max(title_match, lower_match, upper_match)
consistency = best_match / len(vals) * 100
accuracy = 100.0 # default for strings; override with domain rules
# Identifier uniqueness check
if col in id_cols and uniqueness < 100:
issues.append(f"Duplicate IDs ({100 - uniqueness:.1f}%)")
if nulls > 0:
issues.append(f"{nulls} nulls ({nulls/n*100:.1f}%)")
results.append({
'Column': col,
'Completeness': f"{completeness:.1f}%",
'Validity': f"{validity:.1f}%" if not np.isnan(validity) else "—",
'Consistency': f"{consistency:.1f}%" if not np.isnan(consistency) else "—",
'Uniqueness': f"{uniqueness:.1f}%",
'Timeliness': timeliness,
'Accuracy': f"{accuracy:.1f}%" if not np.isnan(accuracy) else "—",
'Issues': "; ".join(issues) if issues else "—"
})
return pd.DataFrame(results)
Assess each of the seven ABS DQF dimensions systematically. For each dimension, provide:
The institutional and organisational factors that influence the credibility and trustworthiness of the data.
Assess the following (where information is available):
| Sub-element | What to Assess |
|---|---|
| Mandate for data collection | Is there a legal, regulatory, or organisational mandate for collecting this data? What legislation, policy, or contract governs the collection? |
| Adequacy of resources | Were sufficient resources (staff, systems, budget) available to collect and maintain the data to the required standard? |
| Quality commitment | Are there documented quality assurance processes? Is there evidence of data validation, review, or quality control during collection? |
| Statistical confidentiality | Are there privacy or confidentiality considerations? Has personally identifiable information (PII) been appropriately handled? Is the data de-identified where required? |
What to look for in the data itself:
Rating guidance:
How well the data meets the needs of users in terms of concepts measured and populations represented.
Assess the following:
| Sub-element | What to Assess |
|---|---|
| Scope and coverage | Does the data cover the target population? Who or what is included and excluded? Are exclusions likely to cause bias? |
| Concepts and classifications | Do the variables/fields measure the right concepts? Are classifications and categories appropriate for the intended use? |
| Reference period | Does the reference period align with the user's analytical needs? |
| User needs alignment | Does the data contain the fields necessary to answer the user's questions? Are there critical gaps? |
What to look for in the data itself:
Rating guidance:
The delay between the reference period and data availability, and the currency of the data.
Assess the following:
| Sub-element | What to Assess |
|---|---|
| Reference period | What period does the data relate to? |
| Collection/extraction date | When was the data actually collected or extracted? |
| Lag | What is the delay between the reference period and the data being made available? |
| Update frequency | How often is the data refreshed? Is this sufficient for the intended use? |
| Currency for intended use | Is the data current enough for the decisions it will inform? |
What to look for in the data itself:
Rating guidance:
How well the data correctly describes the phenomena it was designed to measure.
This is typically the most data-intensive dimension to assess. Evaluate:
| Sub-element | What to Assess |
|---|---|
| Sampling error | If the data is a sample, what is the sampling method? What is the margin of error? Is the sample size adequate? |
| Non-sampling error | Errors from collection, processing, or coverage — not related to sampling |
| Coverage error | Are units in the target population missing from or incorrectly included in the data? |
| Non-response error | What proportion of expected responses are missing? Is there a pattern to non-response (systematic bias)? |
| Response/measurement error | Are there signs of incorrect values from respondents or measurement instruments? |
| Processing error | Are there signs of errors introduced during data entry, coding, editing, or transformation? |
What to look for in the data itself:
Rating guidance:
The internal consistency of the data and its comparability with other sources over time.
Assess the following:
| Sub-element | What to Assess |
|---|---|
| Internal coherence | Are aggregations consistent? Do sub-totals sum to totals? Are cross-tabulations consistent? |
| Temporal coherence | Is the data consistent over time? Have definitions, classifications, or methods changed between periods? |
| Cross-source coherence | Does this data align with other related datasets or published statistics? |
| Standards alignment | Does the data use recognised standards, classifications, or coding schemes? (e.g., ANZSIC, ANZSCO, ISO country codes) |
What to look for in the data itself:
Rating guidance:
The availability of information needed to understand and correctly use the data.
Assess the following:
| Sub-element | What to Assess |
|---|---|
| Concepts and definitions | Are all variables clearly defined? Is there a data dictionary? |
| Classifications and coding | Are classification schemes documented? Are codes explained? |
| Collection methodology | Is the data collection method documented? (survey instrument, administrative form, sensor, etc.) |
| Data processing | Are transformations, derivations, and cleaning steps documented? |
| Known limitations | Are caveats, limitations, and appropriate use cases documented? |
What to look for in the data itself:
Rating guidance:
The ease with which users can find and access the data in a suitable format.
Assess the following:
| Sub-element | What to Assess |
|---|---|
| Discoverability | Can users easily find that this data exists? Is it catalogued or registered? |
| Format suitability | Is the data in a machine-readable, open format? (CSV, JSON, Parquet vs proprietary formats) |
| Access conditions | Are there restrictions on access? (authentication, licensing, cost, approvals) |
| Technical accessibility | Can the data be easily loaded into common tools? (file size, encoding, structure) |
| Documentation accessibility | Is supporting documentation co-located with the data? |
What to look for in the data itself:
Rating guidance:
After completing the assessment, generate a standalone HTML report using the template in references/report-template.html. The report uses SAS-AM branding with light/dark mode, colour-coded scorecard cells, a sidebar navigation, and a print-friendly layout.
The primary output is a single HTML file (data-quality-report.html) that:
The HTML report follows this layout:
Derive an overall quality rating based on the seven dimension ratings:
| Overall Rating | Criteria |
|---|---|
| HIGH | All dimensions rated HIGH or ADEQUATE, with no more than one ADEQUATE |
| ADEQUATE | Majority of dimensions rated ADEQUATE or above, no more than one LOW |
| LOW | Two or more dimensions rated LOW |
| CONDITIONAL | Quality is acceptable for some uses but not others — specify conditions |
Use these CSS classes for rating badges and scorecard cells:
| Rating | Badge Class | Cell Class |
|---|---|---|
| HIGH | rating rating-high | cell-high |
| ADEQUATE | rating rating-adequate | cell-adequate |
| LOW | rating rating-low | cell-low |
| UNABLE TO ASSESS | rating rating-unable | cell-na |
The overall score ring in the executive summary uses an SVG circle with stroke-dashoffset:
circumference = 2 * π * 42 = 264
offset = circumference * (1 - score / 100)
For example, a score of 96.4% → offset = 264 * (1 - 0.964) = 9.5
Set stroke colour based on rating: var(--rating-high) for HIGH, var(--rating-adequate) for ADEQUATE, etc.
Conduct the discovery interview to understand:
Load and profile the data:
Use Python (pandas, numpy) or the appropriate tool for the data format. Write analysis code in a script file or notebook for reproducibility.
Work through each of the seven dimensions systematically:
Generate the HTML report:
references/report-template.html{{PLACEHOLDER}} values with findingscell-high, cell-adequate, cell-low)264 * (1 - score / 100)data-quality-report.html (or a user-specified filename)The skill should handle any data format the user provides:
When writing profiling code, prefer these libraries:
import pandas as pd
import numpy as np
# Load data
df = pd.read_csv("data.csv") # or appropriate reader
# Schema
print(df.dtypes)
print(df.shape)
# Descriptive statistics
print(df.describe(include='all'))
# Missing data
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
print(pd.DataFrame({'missing_count': missing, 'missing_pct': missing_pct}))
# Duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")
# Value distributions for categorical columns
for col in df.select_dtypes(include='object').columns:
print(f"\n{col}:")
print(df[col].value_counts().head(10))
For datasets too large to load fully into memory:
pd.read_csv(..., chunksize=10000))Use Australian English spelling throughout:
Source: Australian Bureau of Statistics, ABS Data Quality Framework, Catalogue No. 1520.0, May 2009.
The ABS DQF was developed based on:
Key principle: Quality is fitness for purpose. The same dataset may be high quality for one use and low quality for another. Always assess quality relative to the intended use.
Trade-offs: There are inherent trade-offs between dimensions. For example:
Before delivering the report, verify:
npx claudepluginhub sas-asset-management/sasamclaudecodeskills --plugin data-quality-analysisReferences data quality dimensions with qsv checks and provides remediation decision tree for tabular CSV assessment and fixes.
Profiles tables or files (CSV, Excel, Parquet, JSON) to reveal shape, null rates, column distributions, top values, percentiles, data quality issues, and column categories.
Profiles unfamiliar datasets: schema structure, column distributions, null rates, cardinality, outliers, table relationships, and temporal coverage. Onboard new data sources, audit freshness, or discover foreign keys.