What is scrape-strategy?

It is Primr's 8-tier web scraping fallback system with sticky tiers, circuit breakers, and content validation to robustly scrape and troubleshoot sites.

How does tier escalation work?

On failure, Primr logs the reason, checks the circuit breaker, and tries the next tier; it stops escalating after 3 consecutive same-error failures.

What counts as content validation?

Content validation looks beyond HTTP status: content length under 1000 bytes, phrases like access denied or captcha, missing content markers, and redirects to login pages.

scrape-strategy

Use Caution

npx machina-cli add skill blisspixel/primr/scrape-strategy --openclaw

Files (1)

SKILL.md

2.6 KB

Scrape Strategy

Primr uses an 8-tier fallback system for web scraping. See references/tiers.md for the full tier table and selection heuristics.

Key Features

Sticky Tier: Once a tier works for a host, it's tried first for subsequent pages
Circuit Breaker: After 3 consecutive failures of the same tier for a host, that tier is skipped
Cookie Handoff: Cookies obtained by browser tiers are reused by faster HTTP tiers
Content Validation: Checks actual content, not just HTTP status -- catches "200 OK" responses that are actually block pages

Error Handling

Content Validation Indicators

Content length < 1000 bytes
Contains "access denied", "blocked", "captcha"
Missing expected content markers
Redirect to login/challenge page

Tier Escalation

On failure: log reason, check circuit breaker, try next tier. Stops after 3 consecutive same-error failures.

Recovery Strategies

Failure Type	Strategy
Timeout	Increase timeout, try slower tier
403 Forbidden	Try stealth tier (4-5)
429 Rate Limit	Exponential backoff, reduce concurrency
SSL Error	Try TLS compatibility tier (3)
Empty Content	Try aggressive tier (2)
CAPTCHA	Skip page, note in results

Interpreting Results

+ 34/46 pages scraped
34 = successfully scraped, 46 = total selected, 12 = failed

70%+ success rate: Good coverage
50-70%: Acceptable for protected sites
<50%: Consider deep mode instead

Example Workflow

User: "The site seems heavily protected"

1. Check prior scrape results:
   - Success rate: 35%
   - Most pages blocked at tier 4

2. Recommend strategy:
   "This site has strong protection. I recommend:
    - Use deep mode for external research
    - Or accept partial scrape results

    Deep mode gathers information from external sources
    without needing to access the protected site directly."

3. If user chooses deep mode:
   estimate_run(company, url, "deep")
   -> Cost: $0.80, Time: ~12 minutes

Constraints

Patient Timeout: 90s max per page (allows multiple tier attempts)
Concurrency: 3 concurrent pages default
Circuit Breaker: 3 failures before tier skip
Smart Escalation: Stops after 3 consecutive same-error failures

Source

git clone https://github.com/blisspixel/primr/blob/main/skills/scrape-strategy/SKILL.mdView on GitHub

Overview

Primr uses an 8-tier fallback system to navigate web scraping challenges, with features like sticky tiers, circuit breakers, and content validation. It helps when scraping fails or when sites implement protections, and it clarifies how tiers behave during escalation.

How This Skill Works

When scraping a host, Primr starts with a working tier (sticky tier) and, on failures, escalates to the next tier. A circuit breaker skips a failing tier after 3 consecutive errors, and cookies from browser tiers are reused by faster HTTP tiers (cookie handoff). Content validation checks the actual content rather than just HTTP status to detect blocks like CAPTCHA or redirects.

When to Use It

Scraping fails due to host errors or timeouts
Site protection blocks access (CAPTCHA, 403, 429, redirects)
Questions about how tiers are selected and escalated
Need to tune timeout, concurrency, or tier usage for a site
Troubleshoot a specific host using tier results and recovery strategies

Quick Start

Step 1: Enable scrape-strategy and begin with the sticky tier for the host
Step 2: If a page fails, log the reason and let the circuit breaker guide tier escalation
Step 3: Apply recovery actions (timeout increase, backoff, and tier changes) based on error type

Best Practices

Leverage the Sticky Tier: once a tier works for a host, reuse it for subsequent pages
Apply the Circuit Breaker: skip a tier after 3 consecutive failures for the same host
Use Cookie Handoff: reuse cookies from browser tiers in faster HTTP tiers
Implement Content Validation: verify content length and markers, not just 200 OK
Follow Recovery Strategies: adjust timeout, backoff on 429, and escalate tiers as needed

Example Use Cases

A site shows strong protection with a 35% success rate; tier 4 blocks most pages, so deep mode or partial scraping is recommended
Encountering 403 Forbidden; switch to stealth tier (4-5) to bypass basic protections
SSL errors resolved by switching to TLS compatibility tier (3)
429 rate limits trigger exponential backoff and reduced concurrency to regain throughput
Content validation flags a redirect to login, leading to page skip and a note in results

Frequently Asked Questions

Add this skill to your agents