data-anonymizer
Scannednpx machina-cli add skill dkyazzentwatwa/chatgpt-skills/data-anonymizer --openclawFiles (1)
SKILL.md
5.8 KB
Data Anonymizer
Detect and mask personally identifiable information (PII) in text documents and structured data. Supports multiple masking strategies and can process CSV files at scale.
Quick Start
from scripts.data_anonymizer import DataAnonymizer
# Anonymize text
anonymizer = DataAnonymizer()
result = anonymizer.anonymize("Contact John Smith at john@email.com or 555-123-4567")
print(result)
# "Contact [NAME] at [EMAIL] or [PHONE]"
# Anonymize CSV
anonymizer.anonymize_csv("customers.csv", "customers_anon.csv")
Features
- PII Detection: Names, emails, phones, SSN, addresses, credit cards, dates
- Multiple Strategies: Mask, redact, hash, fake data replacement
- CSV Processing: Anonymize specific columns or auto-detect
- Reversible Tokens: Optional mapping for de-anonymization
- Custom Patterns: Add your own PII patterns
- Audit Report: List all detected PII with locations
API Reference
Initialization
anonymizer = DataAnonymizer(
strategy="mask", # mask, redact, hash, fake
reversible=False # Enable token mapping
)
Text Anonymization
# Basic anonymization
result = anonymizer.anonymize(text)
# With specific PII types
result = anonymizer.anonymize(text, pii_types=["email", "phone"])
# Get detected PII report
result, report = anonymizer.anonymize(text, return_report=True)
Masking Strategies
text = "Email john@test.com, call 555-1234"
# Mask (default) - replace with type labels
anonymizer.strategy = "mask"
# "Email [EMAIL], call [PHONE]"
# Redact - replace with asterisks
anonymizer.strategy = "redact"
# "Email ***************, call ********"
# Hash - replace with hash
anonymizer.strategy = "hash"
# "Email a1b2c3d4, call e5f6g7h8"
# Fake - replace with realistic fake data
anonymizer.strategy = "fake"
# "Email jane@example.org, call 555-9876"
CSV Processing
# Auto-detect PII columns
anonymizer.anonymize_csv("input.csv", "output.csv")
# Specify columns
anonymizer.anonymize_csv(
"input.csv",
"output.csv",
columns=["name", "email", "phone"]
)
# Different strategies per column
anonymizer.anonymize_csv(
"input.csv",
"output.csv",
column_strategies={
"name": "fake",
"email": "hash",
"ssn": "redact"
}
)
Reversible Anonymization
anonymizer = DataAnonymizer(reversible=True)
# Anonymize with token mapping
result = anonymizer.anonymize("John Smith: john@test.com")
mapping = anonymizer.get_mapping()
# Save mapping securely
anonymizer.save_mapping("mapping.json", encrypt=True, password="secret")
# Later, de-anonymize
anonymizer.load_mapping("mapping.json", password="secret")
original = anonymizer.deanonymize(result)
Custom Patterns
# Add custom PII pattern
anonymizer.add_pattern(
name="employee_id",
pattern=r"EMP-\d{6}",
label="[EMPLOYEE_ID]"
)
CLI Usage
# Anonymize text file
python data_anonymizer.py --input document.txt --output document_anon.txt
# Anonymize CSV
python data_anonymizer.py --input customers.csv --output customers_anon.csv
# Specific strategy
python data_anonymizer.py --input data.csv --output anon.csv --strategy fake
# Generate audit report
python data_anonymizer.py --input document.txt --report audit.json
# Specific PII types only
python data_anonymizer.py --input doc.txt --types email phone ssn
CLI Arguments
| Argument | Description | Default |
|---|---|---|
--input | Input file | Required |
--output | Output file | Required |
--strategy | Masking strategy | mask |
--types | PII types to detect | all |
--columns | CSV columns to process | auto |
--report | Generate audit report | - |
--reversible | Enable token mapping | False |
Supported PII Types
| Type | Examples | Pattern |
|---|---|---|
name | John Smith, Mary Johnson | NLP-based |
email | user@domain.com | Regex |
phone | 555-123-4567, (555) 123-4567 | Regex |
ssn | 123-45-6789 | Regex |
credit_card | 4111-1111-1111-1111 | Regex + Luhn |
address | 123 Main St, City, ST 12345 | NLP + Regex |
date_of_birth | 01/15/1990, January 15, 1990 | Regex |
ip_address | 192.168.1.1 | Regex |
Examples
Anonymize Customer Support Logs
anonymizer = DataAnonymizer(strategy="mask")
log = """
Ticket #1234: Customer John Doe (john.doe@company.com) called about
billing issue. SSN on file: 123-45-6789. Callback number: 555-867-5309.
Address: 123 Oak Street, Springfield, IL 62701.
"""
result = anonymizer.anonymize(log)
print(result)
# Ticket #1234: Customer [NAME] ([EMAIL]) called about
# billing issue. SSN on file: [SSN]. Callback number: [PHONE].
# Address: [ADDRESS].
GDPR Compliance for Database Export
anonymizer = DataAnonymizer(strategy="hash")
# Consistent hashing for joins
anonymizer.anonymize_csv(
"users.csv",
"users_anon.csv",
columns=["email", "name", "phone"]
)
anonymizer.anonymize_csv(
"orders.csv",
"orders_anon.csv",
columns=["customer_email"] # Same hash as users.email
)
Generate Test Data from Production
anonymizer = DataAnonymizer(strategy="fake")
# Replace real PII with realistic fake data
anonymizer.anonymize_csv(
"production_data.csv",
"test_data.csv"
)
# Test data has same structure but fake PII
Dependencies
pandas>=2.0.0
faker>=18.0.0
Limitations
- Name detection may miss unusual names
- Address detection works best for US formats
- Custom patterns may be needed for domain-specific PII
- Fake data replacement doesn't preserve exact format
Source
git clone https://github.com/dkyazzentwatwa/chatgpt-skills/blob/main/data-anonymizer/SKILL.mdView on GitHub Overview
Data Anonymizer detects PII in text documents and CSV data, and masks or redacts it using multiple strategies. It supports reversible tokenization, custom patterns, and an audit report to locate sensitive data.
How This Skill Works
Data Anonymizer scans input text or CSV, detects PII types such as names, emails, phones, SSN, and addresses, and applies a chosen masking strategy (mask, redact, hash, or fake). For CSV it can auto detect or use per-column strategies, and an optional mapping enables reversible anonymization for de-anonymization later.
When to Use It
- Sharing customer support transcripts without exposing names or contact details
- Exporting marketing data for analytics with per-column masking
- Creating privacy-preserving ML datasets while preserving structure
- Producing audit reports that list detected PII and their locations
- Setting up reversible anonymization for secure de-identification workflows
Quick Start
- Step 1: Instantiate DataAnonymizer with a chosen strategy and reversible option
- Step 2: Anonymize text with anonymize(text) or a CSV with anonymize_csv(input.csv, output.csv)
- Step 3: If reversible is enabled, use get_mapping and save_mapping to persist mappings and deanonymize when needed
Best Practices
- Enable reversible mapping only if you need de-anonymization and store mappings securely
- Review the audit report to verify that all PII patterns are covered
- Define per-column strategies for CSV to minimize data loss while maintaining usefulness
- Add custom PII patterns for domain-specific identifiers
- Test on representative data and verify that de-anonymization works end to end
Example Use Cases
- Mask emails and phones in a customer support chat log
- Hash SSNs in a payroll CSV before data sharing
- Redact credit card numbers in transaction records
- Redact or normalize dates in survey responses
- Apply per-column fake data replacement for names and emails in a CRM export
Frequently Asked Questions
Add this skill to your agents