Get the FREE Ultimate OpenClaw Setup Guide →

scrape-strategy

Use Caution
npx machina-cli add skill blisspixel/primr/scrape-strategy --openclaw
Files (1)
SKILL.md
2.6 KB

Scrape Strategy

Primr uses an 8-tier fallback system for web scraping. See references/tiers.md for the full tier table and selection heuristics.

Key Features

  • Sticky Tier: Once a tier works for a host, it's tried first for subsequent pages
  • Circuit Breaker: After 3 consecutive failures of the same tier for a host, that tier is skipped
  • Cookie Handoff: Cookies obtained by browser tiers are reused by faster HTTP tiers
  • Content Validation: Checks actual content, not just HTTP status -- catches "200 OK" responses that are actually block pages

Error Handling

Content Validation Indicators

  • Content length < 1000 bytes
  • Contains "access denied", "blocked", "captcha"
  • Missing expected content markers
  • Redirect to login/challenge page

Tier Escalation

On failure: log reason, check circuit breaker, try next tier. Stops after 3 consecutive same-error failures.

Recovery Strategies

Failure TypeStrategy
TimeoutIncrease timeout, try slower tier
403 ForbiddenTry stealth tier (4-5)
429 Rate LimitExponential backoff, reduce concurrency
SSL ErrorTry TLS compatibility tier (3)
Empty ContentTry aggressive tier (2)
CAPTCHASkip page, note in results

Interpreting Results

+ 34/46 pages scraped
34 = successfully scraped, 46 = total selected, 12 = failed
  • 70%+ success rate: Good coverage
  • 50-70%: Acceptable for protected sites
  • <50%: Consider deep mode instead

Example Workflow

User: "The site seems heavily protected"

1. Check prior scrape results:
   - Success rate: 35%
   - Most pages blocked at tier 4

2. Recommend strategy:
   "This site has strong protection. I recommend:
    - Use deep mode for external research
    - Or accept partial scrape results

    Deep mode gathers information from external sources
    without needing to access the protected site directly."

3. If user chooses deep mode:
   estimate_run(company, url, "deep")
   -> Cost: $0.80, Time: ~12 minutes

Constraints

  • Patient Timeout: 90s max per page (allows multiple tier attempts)
  • Concurrency: 3 concurrent pages default
  • Circuit Breaker: 3 failures before tier skip
  • Smart Escalation: Stops after 3 consecutive same-error failures

Source

git clone https://github.com/blisspixel/primr/blob/main/skills/scrape-strategy/SKILL.mdView on GitHub

Overview

Primr uses an 8-tier fallback system to navigate web scraping challenges, with features like sticky tiers, circuit breakers, and content validation. It helps when scraping fails or when sites implement protections, and it clarifies how tiers behave during escalation.

How This Skill Works

When scraping a host, Primr starts with a working tier (sticky tier) and, on failures, escalates to the next tier. A circuit breaker skips a failing tier after 3 consecutive errors, and cookies from browser tiers are reused by faster HTTP tiers (cookie handoff). Content validation checks the actual content rather than just HTTP status to detect blocks like CAPTCHA or redirects.

When to Use It

  • Scraping fails due to host errors or timeouts
  • Site protection blocks access (CAPTCHA, 403, 429, redirects)
  • Questions about how tiers are selected and escalated
  • Need to tune timeout, concurrency, or tier usage for a site
  • Troubleshoot a specific host using tier results and recovery strategies

Quick Start

  1. Step 1: Enable scrape-strategy and begin with the sticky tier for the host
  2. Step 2: If a page fails, log the reason and let the circuit breaker guide tier escalation
  3. Step 3: Apply recovery actions (timeout increase, backoff, and tier changes) based on error type

Best Practices

  • Leverage the Sticky Tier: once a tier works for a host, reuse it for subsequent pages
  • Apply the Circuit Breaker: skip a tier after 3 consecutive failures for the same host
  • Use Cookie Handoff: reuse cookies from browser tiers in faster HTTP tiers
  • Implement Content Validation: verify content length and markers, not just 200 OK
  • Follow Recovery Strategies: adjust timeout, backoff on 429, and escalate tiers as needed

Example Use Cases

  • A site shows strong protection with a 35% success rate; tier 4 blocks most pages, so deep mode or partial scraping is recommended
  • Encountering 403 Forbidden; switch to stealth tier (4-5) to bypass basic protections
  • SSL errors resolved by switching to TLS compatibility tier (3)
  • 429 rate limits trigger exponential backoff and reduced concurrency to regain throughput
  • Content validation flags a redirect to login, leading to page skip and a note in results

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers