Get the FREE Ultimate OpenClaw Setup Guide →

web-search

npx machina-cli add skill buildoak/fieldwork-skills/web-search --openclaw
Files (1)
SKILL.md
12.2 KB

Web Search

Web search, scraping, and content extraction for AI coding agents. Zero API keys required. Five tools organized in fallback chains: WebSearch and Crawl4AI as primary, Jina as secondary, duckduckgo-search and WebFetch as fallbacks. Use when your agent needs web information -- finding pages, extracting content, or conducting research.

Terminology used in this file:

  • Playwright: Browser automation framework used by Crawl4AI for JavaScript-rendered pages.
  • SPA: Single-page application; content is rendered dynamically in JavaScript.
  • MCP: Model Context Protocol, a standard for exposing tool servers to AI agents.

Setup

python3 -m pip install crawl4ai duckduckgo-search
crawl4ai-setup
  • Claude Code: copy this skill folder into .claude/skills/web-search/
  • Codex CLI: append this SKILL.md content to your project's root AGENTS.md

For the full installation walkthrough (prerequisites, verification, troubleshooting), see references/installation-guide.md.

Staying Updated

This skill ships with an UPDATES.md changelog and UPDATE-GUIDE.md for your AI agent.

After installing, tell your agent: "Check UPDATES.md in the web-search skill for any new features or changes."

When updating, tell your agent: "Read UPDATE-GUIDE.md and apply the latest changes from UPDATES.md."

Follow UPDATE-GUIDE.md so customized local files are diffed before any overwrite.


Quick Start

Run this minimal fallback-safe sequence:

# 1) Find candidate pages
python3 -c "from duckduckgo_search import DDGS; import json; print(json.dumps(DDGS().text('your query', max_results=5), indent=2))"

# 2) Extract one page quickly (no local deps)
curl -s "https://r.jina.ai/http://example.com/article" | head -80

# 3) Escalate to Crawl4AI if JS rendering is needed
crwl https://example.com/app --f markdown --bypass-cache

Use this routing rule: search with WebSearch first, extract with Jina/WebFetch for simple pages, escalate to Crawl4AI for JS-heavy targets.

Decision Tree

Need info from the web?
  |
  +-- Need to SEARCH for pages/answers?
  |     +-- Default first choice --> WebSearch (built-in, zero setup)
  |     +-- WebSearch unavailable? --> Jina s.jina.ai (no key needed)
  |     +-- Both fail? --> duckduckgo-search Python lib (emergency fallback)
  |
  +-- Need to EXTRACT content from a known URL?
  |     +-- JS-heavy SPA, dynamic content? --> Crawl4AI crwl (full browser rendering)
  |     +-- Simple text page (article, docs, blog)? --> Jina r.jina.ai (fast, no install)
  |     +-- Jina/Crawl4AI unavailable? --> WebFetch (built-in, AI-summarized)
  |     +-- Need structured data extraction? --> Crawl4AI with extraction strategy
  |     +-- Multiple URLs in batch? --> Crawl4AI batch mode
  |
  +-- Need DEEP RESEARCH (search + extract + combine)?
        --> WebSearch to find URLs --> Crawl4AI/Jina extract each --> synthesize

Rule of thumb: WebSearch for finding, Jina for reading, Crawl4AI for rendering.

Tool Reference

WebSearch (Built-in) -- Primary Search

What: Claude Code built-in web search tool. Returns search results with links and snippets. Install required: None (built-in to Claude Code) Strengths: Zero setup, zero API keys, integrated into agent workflow, always available Weaknesses: No direct SDK/CLI access (tool-only), results are search-result blocks not raw JSON

# Invoked as a Claude Code tool:
WebSearch(query="your search query")

# Supports domain filtering:
WebSearch(query="your query", allowed_domains=["docs.python.org"])
WebSearch(query="your query", blocked_domains=["pinterest.com"])

Returns: Search result blocks with titles, URLs, and content snippets.

WebFetch (Built-in) -- Fallback URL Extraction

What: Claude Code built-in URL fetcher. Fetches page content, converts HTML to markdown, processes with AI. Install required: None (built-in to Claude Code) Strengths: Zero setup, AI-processed output, handles redirects, 15-min cache Weaknesses: Cannot handle authenticated/private URLs, may summarize large content

# Invoked as a Claude Code tool:
WebFetch(url="https://example.com/page", prompt="Extract the main content")

Limitations:

  • Will fail for authenticated URLs (Google Docs, Confluence, Jira, private GitHub)
  • HTTP auto-upgraded to HTTPS
  • Large content may be summarized rather than returned in full
  • When redirected to a different host, returns redirect URL instead of content

Crawl4AI -- JS-Rendering Web Scraper

What: Open-source scraper with full Playwright browser rendering. Outputs LLM-friendly markdown. Install required: pip install crawl4ai && crawl4ai-setup Strengths: Full JS rendering, handles SPAs, batch crawling, structured extraction Weaknesses: Requires Playwright install, heavier than Jina

# CLI (simplest)
crwl https://example.com
crwl https://example.com -o markdown

# Python API
from crawl4ai import AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url='https://example.com')
    print(result.markdown)

Jina Reader/Search -- Zero-Install Extraction & Search

What: URL-to-markdown converter and search via HTTP API. No install needed -- just curl. API key: Not required. JINA_API_KEY is optional and only increases rate limits. Strengths: Zero install, fast (~1s), works everywhere curl works, search + extract in one service Weaknesses: No JS rendering, rate limited without API key

# Read a URL (returns markdown)
curl -s 'https://r.jina.ai/https://example.com'

# Search (returns search results)
curl -s 'https://s.jina.ai/your+search+query'

# With API key (higher rate limits, optional)
curl -s -H "Authorization: Bearer $JINA_API_KEY" 'https://r.jina.ai/https://example.com'

duckduckgo-search -- Emergency Search Fallback

What: Python library for DuckDuckGo search. Zero API keys, zero registration. Install required: pip install duckduckgo-search Strengths: Completely free, no API key, no rate limit concerns, reliable fallback Weaknesses: Less AI-optimized results than WebSearch, Python-only

from duckduckgo_search import DDGS
results = DDGS().text("your query", max_results=5)
for r in results:
    print(r['title'], r['href'], r['body'])
# One-liner from CLI
python3 -c "from duckduckgo_search import DDGS; import json; print(json.dumps(DDGS().text('your query', max_results=5), indent=2))"

Core Workflows

Pattern 1: Quick Web Search

When: Need factual answers or find relevant pages

  1. Use WebSearch: WebSearch(query="your query here")
  2. Parse results: each result has title, URL, and content snippet
  3. Fallback: curl -s 'https://s.jina.ai/your+query+here'
  4. Emergency: python3 -c "from duckduckgo_search import DDGS; ..."

Pattern 2: URL Content Extraction

When: Have a URL, need its content as clean text/markdown

a) JS-heavy site: crwl URL (Crawl4AI, full rendering) b) Lightweight static page: curl -s 'https://r.jina.ai/URL' (Jina) c) Both fail: WebFetch(url="URL", prompt="Extract the main content")

Decision: Is it a SPA/JS-heavy? Use Crawl4AI. Static content? Use Jina first. If output is empty/broken, escalate.

Pattern 3: Deep Research

When: Need comprehensive research on a topic with multiple sources

  1. WebSearch to find relevant pages
  2. Pick top 3-5 URLs from results
  3. Extract each with Crawl4AI or Jina
  4. If any extraction fails (JS site), use the other tool
  5. Synthesize extracted content into research summary

Token budget: ~5K per extracted page, budget 25K total for 5 pages

Pattern 4: Batch URL Scraping

When: Need content from multiple URLs (5+)

import asyncio
from crawl4ai import AsyncWebCrawler
urls = ['url1', 'url2', 'url3']

async def batch():
    async with AsyncWebCrawler() as crawler:
        for url in urls:
            result = await crawler.arun(url=url)
            print(f'--- {url} ---')
            print(result.markdown[:2000])

asyncio.run(batch())

Pattern 5: Fallback Chain

When: Primary tool fails

Search chain: WebSearch (built-in) --> Jina s.jina.ai --> duckduckgo-search

Extract chain: Crawl4AI crwl --> Jina r.jina.ai --> WebFetch (built-in)

Always try the primary tool first, escalate on failure.


MCP Configuration

Jina MCP (optional enhancement, not required):

{
  "jina-reader": {
    "command": "npx",
    "args": ["-y", "jina-ai-reader-mcp"]
  }
}

MCP (Model Context Protocol) is optional. Your agent can use CLI/Python/built-in tools directly.


Environment Setup

Zero API keys required. All tools work out of the box.

Optional:

  • JINA_API_KEY (get from https://jina.ai) -- increases rate limits, not required
export JINA_API_KEY='jina_...'  # optional

Install:

  • pip install crawl4ai duckduckgo-search
  • crawl4ai-setup # installs Playwright browsers

Built-in tools (WebSearch, WebFetch) require no installation.

Verify: ./scripts/search-check.sh


Anti-Patterns

Do NOTDo instead
Use Crawl4AI for simple text pagesUse Jina r.jina.ai (zero overhead)
Use Jina for JS-heavy SPAsUse Crawl4AI (Jina has no JS rendering)
Skip the fallback chainAlways have a backup: WebSearch->Jina->duckduckgo, Crawl4AI->Jina->WebFetch
Extract full pages when you need one factUse WebSearch (returns relevant snippets directly)
Batch with Jina for 10+ URLsUse Crawl4AI batch mode (designed for it)
Forget rate limitsJina without API key has stricter limits
Use WebFetch for authenticated URLsIt will fail; use browser-ops skill or direct API access

Error Handling

SymptomToolCauseFix
No results returnedWebSearchQuery too specific or topic too nicheBroaden query, try Jina s.jina.ai or duckduckgo-search
Redirect notificationWebFetchURL redirects to different hostMake a new WebFetch request with the provided redirect URL
Auth failureWebFetchAuthenticated/private URLUse browser-ops skill or direct API access instead
Content summarizedWebFetchPage content too largeUse Jina r.jina.ai or Crawl4AI for full content
429 Too Many RequestsJinaRate limit hitAdd JINA_API_KEY header, or add delay between requests
Empty/truncated outputJinaJS-rendered content not capturedEscalate to Crawl4AI: crwl URL
crwl: command not foundCrawl4AINot installed or not on PATHpip install crawl4ai && crawl4ai-setup
Playwright browser not foundCrawl4AIcrawl4ai-setup not runRun: crawl4ai-setup
TimeoutErrorCrawl4AIPage too slow or blockingAdd timeout parameter, check if site blocks bots
SSL certificate errorAnyExpired or self-signed certRetry; for Crawl4AI add ignore_https_errors=True
403 ForbiddenJina/Crawl4AISite blocking automated accessTry different tool from fallback chain
ImportError: duckduckgo_searchduckduckgo-searchPackage not installedpip install duckduckgo-search
RatelimitExceptionduckduckgo-searchToo many requests too fastAdd 1-2s delay between calls, or switch to WebSearch

Bundled Resources Index

PathWhatWhen to load
./UPDATES.mdStructured changelog for AI agentsWhen checking for new features or updates
./UPDATE-GUIDE.mdInstructions for AI agents performing updatesWhen updating this skill
./references/installation-guide.mdDetailed install walkthrough for Claude Code and Codex CLIFirst-time setup or environment repair
./references/tool-comparison.mdSide-by-side comparison: latency, cost, JS support, accuracyWhen choosing between tools for a specific use case
./references/error-patterns.mdDetailed failure modes and recovery per toolWhen debugging a failed extraction or search
./scripts/search-check.shHealth check: verifies all tools are availableBefore first web search task in a session
./scripts/setup.shOne-shot installer for all dependenciesFirst-time setup or after environment reset

Source

git clone https://github.com/buildoak/fieldwork-skills/blob/main/skills/web-search/SKILL.mdView on GitHub

Overview

Web search, scraping, and content extraction for AI coding agents. It operates without API keys and relies on a five-tool fallback chain, with WebSearch and Crawl4AI as primary, Jina as secondary, and duckduckgo-search and WebFetch as backups. Use it when your agent needs web information—finding pages, extracting content, or conducting research.

How This Skill Works

A decision-tree routes tasks: search first with WebSearch; if content extraction is needed, use Jina or WebFetch for simple pages; escalate to Crawl4AI for JavaScript-rendered content. It emphasizes zero-setup tooling and a clear fallback path from primary to emergency options, and supports batch extraction when multiple URLs are involved.

When to Use It

  • You need to locate relevant pages or answers on the web.
  • You must extract content from a known URL, such as articles, docs, or blogs.
  • You need to render and extract content from JS-heavy pages or SPAs.
  • You are conducting deep research that combines search, extraction, and synthesis across multiple sources.
  • You want a zero-setup workflow with reliable fallbacks when primary tools are unavailable.

Quick Start

  1. Step 1: Find candidate pages with WebSearch(query="your search query")
  2. Step 2: Quickly extract a page using Jina or WebFetch (e.g., curl -s \"https://r.jina.ai/http://example.com/article\" | head -80)
  3. Step 3: If the page requires rendering, escalate to Crawl4AI (e.g., crwl https://example.com --f markdown --bypass-cache)

Best Practices

  • Start with WebSearch to discover pages before attempting extraction.
  • Prefer Jina for fast, simple pages that don’t require rendering.
  • Escalate to Crawl4AI only when content is JavaScript-rendered or dynamic.
  • Validate extracted content and record source URLs for traceability.
  • Use Crawl4AI batch mode when you need to process multiple URLs efficiently.

Example Use Cases

  • Research API docs or specifications across multiple sites and compile a unified summary.
  • Summarize a news article or technical blog and cite the original sources.
  • Compare product specs by aggregating data from several vendor pages.
  • Aggregate key insights from a set of related blog posts or tutorials.
  • Extract structured data (titles, URLs, snippets) from a list of search results for modeling.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers