scrape-strategy
Use Cautionnpx machina-cli add skill blisspixel/primr/scrape-strategy --openclawScrape Strategy
Primr uses an 8-tier fallback system for web scraping. See references/tiers.md for the full tier table and selection heuristics.
Key Features
- Sticky Tier: Once a tier works for a host, it's tried first for subsequent pages
- Circuit Breaker: After 3 consecutive failures of the same tier for a host, that tier is skipped
- Cookie Handoff: Cookies obtained by browser tiers are reused by faster HTTP tiers
- Content Validation: Checks actual content, not just HTTP status -- catches "200 OK" responses that are actually block pages
Error Handling
Content Validation Indicators
- Content length < 1000 bytes
- Contains "access denied", "blocked", "captcha"
- Missing expected content markers
- Redirect to login/challenge page
Tier Escalation
On failure: log reason, check circuit breaker, try next tier. Stops after 3 consecutive same-error failures.
Recovery Strategies
| Failure Type | Strategy |
|---|---|
| Timeout | Increase timeout, try slower tier |
| 403 Forbidden | Try stealth tier (4-5) |
| 429 Rate Limit | Exponential backoff, reduce concurrency |
| SSL Error | Try TLS compatibility tier (3) |
| Empty Content | Try aggressive tier (2) |
| CAPTCHA | Skip page, note in results |
Interpreting Results
+ 34/46 pages scraped
34 = successfully scraped, 46 = total selected, 12 = failed
- 70%+ success rate: Good coverage
- 50-70%: Acceptable for protected sites
- <50%: Consider deep mode instead
Example Workflow
User: "The site seems heavily protected"
1. Check prior scrape results:
- Success rate: 35%
- Most pages blocked at tier 4
2. Recommend strategy:
"This site has strong protection. I recommend:
- Use deep mode for external research
- Or accept partial scrape results
Deep mode gathers information from external sources
without needing to access the protected site directly."
3. If user chooses deep mode:
estimate_run(company, url, "deep")
-> Cost: $0.80, Time: ~12 minutes
Constraints
- Patient Timeout: 90s max per page (allows multiple tier attempts)
- Concurrency: 3 concurrent pages default
- Circuit Breaker: 3 failures before tier skip
- Smart Escalation: Stops after 3 consecutive same-error failures
Source
git clone https://github.com/blisspixel/primr/blob/main/skills/scrape-strategy/SKILL.mdView on GitHub Overview
Primr uses an 8-tier fallback system to navigate web scraping challenges, with features like sticky tiers, circuit breakers, and content validation. It helps when scraping fails or when sites implement protections, and it clarifies how tiers behave during escalation.
How This Skill Works
When scraping a host, Primr starts with a working tier (sticky tier) and, on failures, escalates to the next tier. A circuit breaker skips a failing tier after 3 consecutive errors, and cookies from browser tiers are reused by faster HTTP tiers (cookie handoff). Content validation checks the actual content rather than just HTTP status to detect blocks like CAPTCHA or redirects.
When to Use It
- Scraping fails due to host errors or timeouts
- Site protection blocks access (CAPTCHA, 403, 429, redirects)
- Questions about how tiers are selected and escalated
- Need to tune timeout, concurrency, or tier usage for a site
- Troubleshoot a specific host using tier results and recovery strategies
Quick Start
- Step 1: Enable scrape-strategy and begin with the sticky tier for the host
- Step 2: If a page fails, log the reason and let the circuit breaker guide tier escalation
- Step 3: Apply recovery actions (timeout increase, backoff, and tier changes) based on error type
Best Practices
- Leverage the Sticky Tier: once a tier works for a host, reuse it for subsequent pages
- Apply the Circuit Breaker: skip a tier after 3 consecutive failures for the same host
- Use Cookie Handoff: reuse cookies from browser tiers in faster HTTP tiers
- Implement Content Validation: verify content length and markers, not just 200 OK
- Follow Recovery Strategies: adjust timeout, backoff on 429, and escalate tiers as needed
Example Use Cases
- A site shows strong protection with a 35% success rate; tier 4 blocks most pages, so deep mode or partial scraping is recommended
- Encountering 403 Forbidden; switch to stealth tier (4-5) to bypass basic protections
- SSL errors resolved by switching to TLS compatibility tier (3)
- 429 rate limits trigger exponential backoff and reduced concurrency to regain throughput
- Content validation flags a redirect to login, leading to page skip and a note in results