What inputs does the skill require?

Inputs are url and optional options; outputs are markdown_content and metadata.

Can I preserve images and links?

Yes. Use include_images to keep images and preserve_links to retain links in the Markdown.

What metadata is returned?

Metadata includes title, author, publish_date, url, word_count, reading_time, and extraction_success.

Webpage To Markdown

Scanned

npx machina-cli add skill kay-ou/ClaudeSkills/webpage-to-markdown --openclaw

Files (1)

SKILL.md

4.5 KB

name: webpage-to-markdown description: Convert web pages to clean Markdown format, extracting main content while preserving structure. This skill should be used when users need to convert web pages to markdown, extract content from websites, save web articles as markdown, convert HTML to markdown, archive web content, or process online documentation. Keywords: 网页转markdown, 网页转换, HTML转markdown, 提取网页内容, 保存网页为markdown, convert webpage to markdown, extract web content, save article as markdown, HTML to markdown conversion, web scraping to markdown, 网页内容提取, 在线文档转换 inputs:

url
options (optional) outputs:
markdown_content
metadata instructions: |

Webpage to Markdown Converter

This skill fetches web pages and converts them to clean, readable Markdown format. It uses readability algorithms to extract the main content while preserving the document structure and formatting.

When to Use This Skill

Use this skill when you need to:

Convert web articles or documentation to Markdown
Extract readable content from web pages
Process web content for analysis or transformation
Archive web content in a readable format
Prepare web content for further processing

Processing Steps

Step 1: Fetch Web Page Content

Validate the provided URL format
Send HTTP GET request with appropriate headers
Handle redirects and follow them safely
Manage different character encodings
Handle network errors and timeouts gracefully

Step 2: Parse HTML Structure

Parse HTML using a robust parser
Identify the main content area using readability heuristics
Remove navigation, ads, sidebars, and other non-content elements
Preserve the document's logical structure
Handle malformed HTML gracefully

Step 3: Extract Main Content

Use readability algorithms to identify primary content
Preserve headings and their hierarchy (H1-H6)
Maintain paragraph structure and flow
Keep important formatting elements
Extract images with their alt text and sources
Preserve links with proper anchor text

Step 4: Convert to Markdown

Convert headings to appropriate Markdown levels
Transform paragraphs to Markdown format
Convert lists (ordered and unordered) to Markdown syntax
Handle code blocks with language detection
Convert tables to Markdown table format
Preserve emphasis (bold, italic) formatting
Handle blockquotes appropriately

Step 5: Generate Clean Output

Remove excessive whitespace and blank lines
Ensure consistent formatting throughout
Validate Markdown syntax
Optimize for readability
Preserve important semantic information

Optional Parameters

The options parameter can include:

include_images: Whether to include images (default: true)
preserve_links: Whether to keep all links (default: true)
extract_metadata: Whether to extract page metadata (default: true)
content_selector: CSS selector for specific content extraction
exclude_selectors: CSS selectors for elements to exclude

Output Format

Returns an object containing:

{
  "markdown_content": "# Article Title\n\nArticle content in Markdown format...",
  "metadata": {
    "title": "Original Page Title",
    "author": "Article Author",
    "publish_date": "2024-01-15",
    "url": "https://example.com/article",
    "word_count": 1250,
    "reading_time": 5,
    "extraction_success": true
  }
}

Content Preservation

The converter preserves:

Headings: Maintains heading hierarchy (H1 → #, H2 → ##, etc.)
Paragraphs: Clean paragraph breaks and spacing
Lists: Both ordered (1., 2., 3.) and unordered (-, *, +) lists
Code Blocks: With language detection and proper fencing
Links: Internal and external links with descriptive text
Images: With alt text and source URLs
Tables: Converted to Markdown table syntax
Emphasis: Bold (text) and italic (text) formatting
Blockquotes: Preserved with > syntax

Error Handling

Invalid URLs: Return appropriate error messages
Network failures: Handle timeouts and connection issues
Malformed HTML: Gracefully handle poorly formatted pages
Missing content: Return meaningful error if no content found
Encoding issues: Handle various character encodings
Large pages: Manage memory efficiently for large documents

Source

git clone https://github.com/kay-ou/ClaudeSkills/blob/main/.claude/skills/webpage-to-markdown/SKILL.mdView on GitHub

Overview

Webpage To Markdown fetches a URL and converts the page into readable Markdown. It uses readability heuristics to extract the main content while preserving headings, lists, images with alt text, links, and formatting. The tool is ideal for offline reading, archiving, or feeding content into other workflows.

How This Skill Works

The skill retrieves the web page content, handles redirects and encodings, and parses HTML to identify the primary content. It then converts the extracted content into Markdown, preserving document structure (headings, lists, code blocks, tables) and media such as images and links, with optional controls for images, links, and metadata extraction.

When to Use It

Convert web articles or documentation to Markdown for offline viewing or CMS import
Extract readable content from complex web pages while discarding navigation and ads
Archive web content in a clean, portable Markdown format for analysis or backup
Prepare web content for processing by other tools (summaries, indexes, or transclusion)
Convert online documentation or tutorials to Markdown for offline study

Quick Start

Step 1: Provide a URL and optional parameters (include_images, preserve_links, extract_metadata)
Step 2: Run the Webpage To Markdown skill to fetch and convert
Step 3: Retrieve markdown_content and metadata from outputs and review

Best Practices

Validate the input URL and test with representative pages before wide use
Leverage include_images and preserve_links options to control output
Use content_selector and exclude_selectors to focus on the relevant content
Review headings hierarchy (H1-H6) and adjust if needed after conversion
Verify the generated Markdown for readability and proper syntax; fix tables or code blocks if needed

Example Use Cases

Save a blog post as Markdown for a static site generator
Archive documentation pages in Markdown for offline access
Extract product guides from vendor sites for a knowledge base
Convert API reference pages into Markdown for internal tooling
Transform news articles into Markdown for research notes

Frequently Asked Questions

Add this skill to your agents