data-quality-auditor
Scannednpx machina-cli add skill dkyazzentwatwa/chatgpt-skills/data-quality-auditor --openclawData Quality Auditor
Comprehensive data quality assessment for CSV/Excel datasets.
Features
- Completeness: Missing values analysis
- Uniqueness: Duplicate detection
- Validity: Type validation and constraints
- Consistency: Pattern and format checks
- Quality Score: Overall data quality metric
- Reports: Detailed HTML/JSON reports
Quick Start
from data_quality_auditor import DataQualityAuditor
auditor = DataQualityAuditor()
auditor.load_csv("customers.csv")
# Run full audit
report = auditor.audit()
print(f"Quality Score: {report['quality_score']}/100")
# Check specific issues
missing = auditor.check_missing()
duplicates = auditor.check_duplicates()
CLI Usage
# Full audit
python data_quality_auditor.py --input data.csv
# Generate HTML report
python data_quality_auditor.py --input data.csv --report report.html
# Check specific aspects
python data_quality_auditor.py --input data.csv --missing
python data_quality_auditor.py --input data.csv --duplicates
python data_quality_auditor.py --input data.csv --types
# JSON output
python data_quality_auditor.py --input data.csv --json
# Validate against rules
python data_quality_auditor.py --input data.csv --rules rules.json
API Reference
DataQualityAuditor Class
class DataQualityAuditor:
def __init__(self)
# Data loading
def load_csv(self, filepath: str, **kwargs) -> 'DataQualityAuditor'
def load_dataframe(self, df: pd.DataFrame) -> 'DataQualityAuditor'
# Full audit
def audit(self) -> dict
def quality_score(self) -> float
# Individual checks
def check_missing(self) -> dict
def check_duplicates(self, subset: list = None) -> dict
def check_types(self) -> dict
def check_uniqueness(self) -> dict
def check_patterns(self, column: str, pattern: str) -> dict
# Validation
def validate_column(self, column: str, rules: dict) -> dict
def validate_dataset(self, rules: dict) -> dict
# Reports
def generate_report(self, output: str, format: str = "html") -> str
def summary(self) -> str
Quality Checks
Missing Values
missing = auditor.check_missing()
# Returns:
{
"total_cells": 10000,
"missing_cells": 150,
"missing_percent": 1.5,
"by_column": {
"email": {"count": 50, "percent": 5.0},
"phone": {"count": 100, "percent": 10.0}
},
"rows_with_missing": 120
}
Duplicates
dups = auditor.check_duplicates()
# Returns:
{
"total_rows": 1000,
"duplicate_rows": 25,
"duplicate_percent": 2.5,
"duplicate_groups": [...],
"by_columns": {
"email": {"duplicates": 15},
"phone": {"duplicates": 20}
}
}
Type Validation
types = auditor.check_types()
# Returns:
{
"columns": {
"age": {
"detected_type": "int64",
"unique_values": 75,
"sample_values": [25, 30, 45],
"issues": []
},
"date": {
"detected_type": "object",
"unique_values": 365,
"sample_values": ["2023-01-01", "invalid"],
"issues": ["Mixed date formats detected"]
}
}
}
Validation Rules
Define custom validation rules:
{
"columns": {
"email": {
"required": true,
"unique": true,
"pattern": "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
},
"age": {
"type": "integer",
"min": 0,
"max": 120
},
"status": {
"allowed_values": ["active", "inactive", "pending"]
},
"created_at": {
"type": "date",
"format": "%Y-%m-%d"
}
}
}
results = auditor.validate_dataset(rules)
Quality Score
The quality score (0-100) is calculated from:
- Completeness (30%): Missing value ratio
- Uniqueness (25%): Duplicate row ratio
- Validity (25%): Type and constraint compliance
- Consistency (20%): Format and pattern adherence
score = auditor.quality_score()
# 85.5
Output Formats
Audit Report
{
"file": "data.csv",
"rows": 1000,
"columns": 15,
"quality_score": 85.5,
"completeness": {
"score": 92.0,
"missing_cells": 800,
"details": {...}
},
"uniqueness": {
"score": 97.5,
"duplicate_rows": 25,
"details": {...}
},
"validity": {
"score": 78.0,
"type_issues": [...],
"details": {...}
},
"consistency": {
"score": 80.0,
"pattern_issues": [...],
"details": {...}
},
"recommendations": [
"Column 'phone' has 10% missing values",
"25 duplicate rows detected",
"Column 'date' has inconsistent formats"
]
}
Example Workflows
Pre-Import Validation
auditor = DataQualityAuditor()
auditor.load_csv("import_data.csv")
report = auditor.audit()
if report['quality_score'] < 80:
print("Data quality below threshold!")
for rec in report['recommendations']:
print(f" - {rec}")
exit(1)
ETL Pipeline Check
auditor = DataQualityAuditor()
auditor.load_dataframe(transformed_df)
# Check critical columns
email_check = auditor.validate_column("email", {
"required": True,
"unique": True,
"pattern": r"^[\w.+-]+@[\w-]+\.[\w.-]+$"
})
if email_check['issues']:
raise ValueError(f"Email validation failed: {email_check['issues']}")
Generate Documentation
auditor = DataQualityAuditor()
auditor.load_csv("dataset.csv")
# Generate comprehensive report
auditor.generate_report("quality_report.html", format="html")
# Or get summary text
print(auditor.summary())
Dependencies
- pandas>=2.0.0
- numpy>=1.24.0
Source
git clone https://github.com/dkyazzentwatwa/chatgpt-skills/blob/main/data-quality-auditor/SKILL.mdView on GitHub Overview
Data Quality Auditor provides a comprehensive assessment of data quality for CSV and Excel datasets. It checks completeness, uniqueness, validity, and consistency, and can generate HTML or JSON reports along with an overall quality score. This makes it essential for data validation, ETL pipelines, and creating clear dataset documentation.
How This Skill Works
Load data with load_csv or load_dataframe, then run audit() to compute a quality_score and produce structured reports. Use dedicated checks like check_missing, check_duplicates, check_types, and check_patterns to identify issues and verify data against custom validation rules via validate_dataset.
When to Use It
- Before loading data into ETL/ELT pipelines to catch missing values, duplicates, and type issues
- During data analytics preparation to ensure consistent formats and valid data types
- When preparing dataset documentation to include a quality score and issue breakdown
- For regular audits of data exports (CSV/Excel) to detect regressions in quality
- To enforce constraints with custom validation rules defined in a rules.json file
Quick Start
- Step 1: from data_quality_auditor import DataQualityAuditor
- Step 2: auditor = DataQualityAuditor(); auditor.load_csv('customers.csv')
- Step 3: report = auditor.audit(); print(f"Quality Score: {report['quality_score']}/100"); missing = auditor.check_missing()
Best Practices
- Run a full audit after loading data to establish a baseline quality score
- Inspect missing and duplicates per column to prioritize fixes
- Use check_types and check_patterns to enforce data type and formatting constraints
- Validate the dataset against custom rules with validate_dataset
- Generate HTML or JSON reports for stakeholders and documentation
Example Use Cases
- Customer records: identify incomplete emails and invalid date formats
- Orders dataset: detect duplicate orders and inconsistent price types
- Marketing leads: enforce unique emails and correct phone formats
- IoT sensor data: catch missing timestamps and irregular value patterns
- Product catalog: ensure valid SKUs and consistent category fields