Get the FREE Ultimate OpenClaw Setup Guide →

data-sourcing

npx machina-cli add skill Microck/ordinary-claude-skills/data-sourcing --openclaw
Files (1)
SKILL.md
9.5 KB

Data Sourcing & Provider Optimization Skill

When to Use

  • Selecting provider stacks for email, phone, company, or intent enrichment
  • Building or tuning waterfall sequences to improve success rates
  • Auditing credit consumption or provider performance
  • Designing enrichment logic for GTM ops, RevOps, or data engineering teams

Framework

You are an expert at selecting and optimizing data providers from 150+ available options to maximize data quality while minimizing credit costs. Use this layered framework to keep enrichment predictable and efficient.

Core Principles

  1. Quality-Cost Balance: Optimize for highest data quality within budget constraints
  2. Smart Routing: Route requests to providers based on input type and success probability
  3. Waterfall Logic: Use sequential provider attempts for maximum success
  4. Caching Strategy: Leverage cached data to reduce redundant API calls
  5. Bulk Optimization: Process similar requests together for volume discounts

Provider Selection Matrix

For Email Discovery

Best Input Scenarios:

  • Have LinkedIn URL: ContactOut → RocketReach → Apollo
  • Have Name + Company: Apollo → Hunter → RocketReach → FindyMail
  • Have Domain Only: Hunter → Apollo → Clearbit
  • Have Email (need validation): ZeroBounce → NeverBounce → Debounce

Quality Tiers:

  • Premium (90%+ success): ZoomInfo, BetterContact waterfall
  • Standard (75%+ success): Apollo, Hunter, RocketReach
  • Budget (60%+ success): Snov.io, Prospeo, ContactOut

For Company Intelligence

Data Type Priority:

  • Basic Firmographics: Clearbit (fastest) → Ocean.io → Apollo
  • Financial Data: Crunchbase → PitchBook → Dealroom
  • Technology Stack: BuiltWith → HG Insights → Clearbit
  • Intent Signals: B2D AI → ZoomInfo Intent → 6sense
  • News & Social: Google News → Social platforms → Owler

Industry Specialization:

  • Startups: Crunchbase, Dealroom, AngelList
  • Enterprise: ZoomInfo, D&B, HG Insights
  • E-commerce: Store Leads, BuiltWith, Shopify data
  • Healthcare: Definitive Healthcare + compliance providers
  • Financial Services: PitchBook, S&P Capital IQ

Credit Optimization Strategies

Cost Tiers

Tier 0 (Free): Native operations, cached data, manual inputs
Tier 1 (0.5 credits): Validation, verification, basic lookups
Tier 2 (1-2 credits): Standard enrichments (Apollo, Hunter, Clearbit)
Tier 3 (2-3 credits): Premium data (ZoomInfo, technographics, intent)
Tier 4 (3-5 credits): Enterprise intelligence (PitchBook, custom AI)
Tier 5 (5-10 credits): Specialized services (video generation, deep AI research)

Optimization Tactics

1. Cache Everything

  • Email: 30-day cache
  • Company: 90-day cache
  • Intent: 7-day cache
  • Static data: Indefinite cache

2. Batch Processing

# Process in batches for volume discounts
if record_count > 1000:
    use_provider("apollo_bulk")  # 10-30% discount
elif record_count > 100:
    use_parallel_processing()
else:
    use_standard_processing()

3. Smart Waterfalls

waterfall_sequence = [
    {"provider": "cache", "credits": 0},
    {"provider": "apollo", "credits": 1.5, "stop_if_success": True},
    {"provider": "hunter", "credits": 1.2, "stop_if_success": True},
    {"provider": "bettercontact", "credits": 3, "stop_if_success": True},
    {"provider": "ai_research", "credits": 5, "last_resort": True}
]

Provider-Specific Optimizations

Apollo.io

  • Strengths: US B2B, LinkedIn data, phone numbers
  • Weaknesses: International coverage, personal emails
  • Tips: Use bulk API for 10%+ discount, batch similar companies

ZoomInfo

  • Strengths: Enterprise data, org charts, intent signals
  • Weaknesses: Expensive, SMB coverage
  • Tips: Reserve for high-value accounts, negotiate enterprise deals

Hunter

  • Strengths: Domain searches, email patterns, API reliability
  • Weaknesses: Phone numbers, detailed contact info
  • Tips: Best for initial domain exploration, use pattern detection

Clearbit

  • Strengths: Real-time API, company data, speed
  • Weaknesses: Email discovery rates, phone numbers
  • Tips: Great for instant enrichment, combine with others for contacts

BuiltWith

  • Strengths: Technology detection, historical data, e-commerce
  • Weaknesses: Contact information, company financials
  • Tips: Filter accounts by technology before enrichment

Waterfall Strategies

Maximum Success Waterfall

Priority: Success rate over cost
Sequence:
  1. BetterContact (aggregates 10+ sources)
  2. ZoomInfo (if enterprise)
  3. Apollo + Hunter + RocketReach
  4. AI web research
Expected Success: 95%+
Average Cost: 8-12 credits

Balanced Waterfall

Priority: Good success with reasonable cost
Sequence:
  1. Apollo.io
  2. Hunter (if domain match)
  3. RocketReach (if name match)
  4. Stop or continue based on confidence
Expected Success: 80%
Average Cost: 3-5 credits

Budget Waterfall

Priority: Minimize cost
Sequence:
  1. Cache check
  2. Hunter (domain only)
  3. Free sources (Google, LinkedIn public)
  4. Stop at first result
Expected Success: 60%
Average Cost: 1-2 credits

Quality Scoring Framework

def calculate_data_quality_score(data, sources):
    score = 0
    
    # Multi-source validation (30 points)
    if len(sources) > 1:
        score += min(len(sources) * 10, 30)
    
    # Data completeness (30 points)
    required_fields = ["email", "phone", "title", "company"]
    score += sum(10 for field in required_fields if data.get(field))
    
    # Verification status (20 points)
    if data.get("email_verified"):
        score += 10
    if data.get("phone_verified"):
        score += 10
    
    # Recency (20 points)
    days_old = get_data_age(data)
    if days_old < 30:
        score += 20
    elif days_old < 90:
        score += 10
    
    return score

Industry-Specific Provider Selection

SaaS/Technology

  • Primary: Apollo, Clearbit, BuiltWith
  • Secondary: ZoomInfo, HG Insights
  • Intent: G2, TrustRadius, 6sense

Financial Services

  • Primary: PitchBook, ZoomInfo
  • Compliance: LexisNexis, D&B
  • News: Bloomberg, Reuters

Healthcare

  • Primary: Definitive Healthcare
  • Compliance: NPPES, state boards
  • Standard: ZoomInfo with healthcare filters

E-commerce

  • Primary: Store Leads, BuiltWith
  • Platform-specific: Shopify, Amazon seller data
  • Standard: Clearbit with e-commerce signals

Troubleshooting Common Issues

Low Email Discovery Rate

  • Check email patterns with Hunter
  • Try personal email providers
  • Use AI research for executives
  • Consider LinkedIn outreach instead

High Credit Usage

  • Audit waterfall sequences
  • Increase cache TTL
  • Negotiate volume deals
  • Use native operations first

Poor Data Quality

  • Add verification steps
  • Cross-reference multiple sources
  • Set minimum confidence thresholds
  • Implement human review for critical data

Advanced Techniques

Hybrid Enrichment

# Combine AI and traditional providers
def hybrid_enrichment(company):
    # Fast, cheap base data
    base = clearbit_lookup(company)
    
    # AI for missing pieces
    if not base.get("description"):
        base["description"] = ai_generate_description(company)
    
    # Premium for high-value
    if is_enterprise_account(base):
        base.update(zoominfo_enrich(company))
    
    return base

Progressive Enrichment

# Enrich in stages based on engagement
def progressive_enrichment(lead):
    # Stage 1: Basic (on import)
    if lead.stage == "new":
        return basic_enrichment(lead)  # 1-2 credits
    
    # Stage 2: Engaged (opened email)
    elif lead.stage == "engaged":
        return standard_enrichment(lead)  # 3-5 credits
    
    # Stage 3: Qualified (booked meeting)
    elif lead.stage == "qualified":
        return comprehensive_enrichment(lead)  # 10+ credits

Templates

  • Provider Cheat Sheet: See references/provider_cheat_sheet.md for provider selection.
  • Cost Calculator: See scripts/cost_calculator.py for estimating credit usage.
  • Integration Code Templates:
// JavaScript/Node.js template
const enrichContact = async (name, company) => {
  // Check cache first
  const cached = await checkCache(name, company);
  if (cached) return cached;
  
  // Try providers in sequence
  const providers = ['apollo', 'hunter', 'rocketreach'];
  
  for (const provider of providers) {
    try {
      const result = await callProvider(provider, {name, company});
      if (result.email) {
        await saveToCache(result);
        return result;
      }
    } catch (error) {
      console.log(`${provider} failed, trying next...`);
    }
  }
  
  // Fallback to AI research
  return await aiResearch(name, company);
};

Tips

  • Pre-build waterfalls per motion so GTM teams can call a single orchestration command rather than juggling providers.
  • Instrument cache hit rates; alert RevOps when cache effectiveness drops below target to avoid spike in credits.
  • Rotate premium providers each quarter to negotiate better volume discounts and diversify coverage gaps.
  • Pair enrichment with QA hooks (e.g., verification APIs, sampling) before syncing into CRM to prevent bad data cascades.

Progressive disclosure: Load full provider details and code examples only when actively optimizing enrichment workflows

Source

git clone https://github.com/Microck/ordinary-claude-skills/blob/main/skills_all/data-sourcing/SKILL.mdView on GitHub

Overview

Optimize provider selection, routing, and credit usage across 150+ enrichment sources for company and contact intelligence. This skill helps build efficient enrichment pipelines, tune waterfalls, and audit credit consumption for GTM, RevOps, and data engineering teams.

How This Skill Works

It applies a layered framework to maximize data quality while controlling costs: Quality-Cost Balance, Smart Routing, Waterfall Logic, Caching Strategy, and Bulk Optimization. It also defines provider selection matrices for Email Discovery and Company Intelligence and uses sequential attempts, caching, and batch processing to reduce credits.

When to Use It

  • Selecting provider stacks for email, phone, company, or intent enrichment
  • Building or tuning waterfall sequences to improve success rates
  • Auditing credit consumption or provider performance
  • Designing enrichment logic for GTM ops, RevOps, or data engineering teams
  • Scaling enrichment across 150+ sources while controlling costs

Quick Start

  1. Step 1: Map inputs to enrichment needs (email, company, domain, intent) and note available providers.
  2. Step 2: Create a Smart Waterfall sequence and assign per-provider credit budgets.
  3. Step 3: Enable caching and batch processing; monitor results and iterate.

Best Practices

  • Prioritize data quality while respecting budget constraints (Quality-Cost Balance).
  • Route requests based on input type and success probability (Smart Routing).
  • Use Waterfall Logic with stop_if_success to minimize unnecessary credits.
  • Cache results (e.g., Email 30 days, Company 90 days, Intent 7 days) to cut calls.
  • Batch similar records to unlock volume discounts (Bulk Optimization).

Example Use Cases

  • Email discovery sequencing: Have LinkedIn URL → ContactOut → RocketReach → Apollo.
  • Have Name + Company: Apollo → Hunter → RocketReach → FindyMail.
  • Company intelligence: Basic Firmographics prioritized as Clearbit → Ocean.io → Apollo.
  • Caching strategy: Cache Email for 30 days, Company for 90 days, Intent for 7 days.
  • Batch processing: >1000 records triggers apollo_bulk discounts (10-30% savings).

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers