What is building-github-index?

A tool to generate Markdown indexes of GitHub repositories optimized for Claude project knowledge, with a defined priority for extracting titles and descriptions (YAML frontmatter, markdown headings, notebook headings, code symbols when enabled, then path-derived titles).

How do I combine multiple repos into one index?

Pass multiple owner/repo arguments to the script and specify an output file such as combined.md; you can also enable code-symbols for code-heavy repos with --code-symbols.

What if a repository has missing or stub descriptions?

Manual curation is recommended. Use the tree output and domain knowledge. For example, generate skeleton.md with --skip-fetch, then manually enhance descriptions based on knowledge, e.g., python scripts/github_index.py owner/repo --skip-fetch -o skeleton.md and then enrich it.

building-github-index

Scanned

npx machina-cli add skill oaustegard/claude-skills/building-github-index --openclaw

Files (1)

SKILL.md

3.9 KB

Building GitHub Index

Create markdown indexes of GitHub repositories optimized for Claude project knowledge. Indexes enable retrieval via GitHub API with semantic descriptions for effective matching.

Quick Start

# Documentation repos (markdown/notebooks)
python scripts/github_index.py owner/repo -o index.md

# Code repos (extract symbols via tree-sitter)
python scripts/github_index.py owner/repo --code-symbols -o index.md

# Multiple repos combined
python scripts/github_index.py owner/repo1 owner/repo2 -o combined.md

Script Options

Flag	Description
`-o, --output`	Output file (default: `github_index.md`)
`--token`	GitHub PAT; also reads `GITHUB_TOKEN` env
`--include-patterns`	Only index matching globs: `"docs/" "src/"`
`--exclude-patterns`	Skip matching globs: `"test/**"`
`--max-files`	Cap files per repo (default: 200)
`--skip-fetch`	Tree only, no content fetch (fast, filename-only descriptions)
`--code-symbols`	Include code files, extract function/class names via tree-sitter

Description Extraction Priority

YAML frontmatter - title: and description: fields
Markdown headings - First h1/h2 as title, subsequent as topics
Notebook cells - First markdown cell heading
Code symbols - Public function/class names (with --code-symbols)
Path-derived - Convert filename to words (fallback)

When Descriptions Fail

Some repos have stub files (links to external docs, empty readmes). In these cases:

Manual curation recommended. Use the tree output and domain knowledge:

# Get tree structure only (fast)
python scripts/github_index.py owner/repo --skip-fetch -o skeleton.md
# Then manually enhance descriptions based on domain knowledge

For code-heavy repos with embedded apps:

Directory names encode purpose: acc_wav_gen → "ACC waveform generation"
Peripheral acronyms map to functions: AFEC=ADC, MCAN=CAN, TWIHS=I2C
Operation modes: blocking, interrupt, dma, polled

Output Format

# {Repo} - Content Index

**Repository:** {url}
**Branch:** `{branch}`

## Retrieval Method
{API curl commands}

---

## {Category}

| Description | Path |
|-------------|------|
| {What this covers} | `{path/file.md}` |

Description column leads (relevance matching), path follows (retrieval key).

API Access

Enumerate files:

curl -sL "https://api.github.com/repos/OWNER/REPO/git/trees/BRANCH?recursive=1"

Fetch content:

curl -s "https://api.github.com/repos/OWNER/REPO/contents/PATH?ref=BRANCH" \
  -H "Accept: application/vnd.github+json" | \
  python3 -c "import sys,json,base64; print(base64.b64decode(json.load(sys.stdin)['content']).decode())"

Network

Both scripts download a repo tarball (single HTTP request, no per-file rate limits) then process files locally. Allowlist: api.github.com (tarball redirects via this endpoint)

Related Skills

accessing-github-repos - Private repos, PAT setup, tarball download
mapping-codebases - Detailed code structure (methods, imports, line numbers)

Condensed Format (pk_index.py)

For token-constrained project knowledge, use the condensed script:

python scripts/pk_index.py owner/repo -o repo_pk.md

Produces ~80% smaller output:

Single line per file: path — description
Symbols only (no signatures)
15 files max per category
No retrieval instructions section

Ideal when adding multiple repo indexes to project knowledge.

Source

git clone https://github.com/oaustegard/claude-skills/blob/main/building-github-index/SKILL.mdView on GitHub

Overview

Generates markdown indexes of GitHub repositories optimized for Claude project knowledge. The indexes enable retrieval via semantic descriptions and support linking external documentation, technical blogs, or knowledge bases. They also allow combining multiple repos into a single project index for easy knowledge access.

How This Skill Works

The tool scans repository content and extracts descriptions using a priority order: YAML frontmatter (title and description), Markdown headings (first h1/h2 as title, subsequent as topics), notebook cells (first markdown heading), code symbols (public function/class names with --code-symbols), and finally path-derived titles as a fallback. Output is a markdown index (index.md or combined.md) that can be retrieved via standard formats.

When to Use It

Setting up projects that reference external documentation or knowledge bases
Creating searchable indexes of technical blogs or knowledge bases
Combining multiple GitHub repos into a single index for a project
When a user mentions index, github repo, project knowledge, or documentation reference
Preparing Claude project knowledge with index-based retrieval

Quick Start

Step 1: Documentation repos (markdown/notebooks) -> python scripts/github_index.py owner/repo -o index.md
Step 2: Code repos (extract symbols via tree-sitter) -> python scripts/github_index.py owner/repo --code-symbols -o index.md
Step 3: Multiple repos combined -> python scripts/github_index.py owner/repo1 owner/repo2 -o combined.md

Best Practices

Enable --code-symbols for code-heavy repos to capture public function/class names
Use --include-patterns and --exclude-patterns to focus indexing on relevant files
Prefer YAML frontmatter and Markdown headings to maximize accurate titles and descriptions
If descriptions are missing, use the skeleton workflow with --skip-fetch for manual curation
Name outputs clearly (index.md, combined.md) and document the scope of each index

Example Use Cases

Index a documentation repo: owner/repo with index.md for project knowledge
Combine two docs repos: owner1/repo1 and owner2/repo2 into combined.md
Index a code-heavy repo: owner/repo with --code-symbols to capture functions/classes
Generate a skeleton for a stub repo: owner/repo --skip-fetch -o skeleton.md
Use condensed pk_index for token-constrained knowledge: owner/repo -o repo_pk.md

Frequently Asked Questions

Add this skill to your agents