What do similarity scores mean?

Scores range from 0.0 to 1.0 and indicate how similar two texts are, with higher values meaning greater similarity. The interpretation can depend on the chosen method (cosine, Jaccard, or Levenshtein).

Which algorithms are available in this skill?

Cosine similarity (TF-IDF), Jaccard similarity, and Levenshtein distance are available, with support for TF-IDF + cosine and batch/corpus comparisons.

How do I run comparisons in batch or via the CLI?

Use batch methods like compare_to_corpus and similarity_matrix for multiple documents, and the CLI commands shown in the SKILL.md (e.g., python similarity_checker.py with --folder, --method, or --json) to tailor outputs.

content-similarity-checker

npx machina-cli add skill dkyazzentwatwa/chatgpt-skills/content-similarity-checker --openclaw

Files (1)

SKILL.md

6.7 KB

Content Similarity Checker

Compare documents and text for similarity using multiple algorithms.

Features

Cosine Similarity: TF-IDF based comparison
Jaccard Similarity: Set-based comparison
Levenshtein Distance: Edit distance for short texts
Batch Comparison: Compare multiple documents
Similarity Matrix: Pairwise comparison of all documents
Reports: Detailed similarity reports

Quick Start

from similarity_checker import SimilarityChecker

checker = SimilarityChecker()

# Compare two texts
score = checker.compare(
    "The quick brown fox jumps over the lazy dog",
    "A fast brown fox leaps over a sleepy dog"
)
print(f"Similarity: {score:.2%}")

# Compare documents
score = checker.compare_files("doc1.txt", "doc2.txt")

CLI Usage

# Compare two texts
python similarity_checker.py --text1 "Hello world" --text2 "Hello there world"

# Compare two files
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt

# Compare all files in folder
python similarity_checker.py --folder ./documents/ --output matrix.csv

# Use specific algorithm
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --method jaccard

# Find similar documents (threshold)
python similarity_checker.py --folder ./documents/ --threshold 0.7

# JSON output
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --json

API Reference

SimilarityChecker Class

class SimilarityChecker:
    def __init__(self, method: str = "cosine")

    # Text comparison
    def compare(self, text1: str, text2: str) -> float
    def compare_files(self, file1: str, file2: str) -> float

    # Multiple algorithms
    def compare_all_methods(self, text1: str, text2: str) -> dict

    # Batch comparison
    def compare_to_corpus(self, text: str, corpus: list) -> list
    def similarity_matrix(self, documents: list) -> pd.DataFrame
    def find_duplicates(self, documents: list, threshold: float = 0.8) -> list

    # Folder operations
    def compare_folder(self, folder: str, threshold: float = None) -> dict
    def find_most_similar(self, text: str, folder: str, top_n: int = 5) -> list

    # Report
    def generate_report(self, output: str) -> str

Similarity Methods

Cosine Similarity (Default)

Best for comparing documents of different lengths:

checker = SimilarityChecker(method="cosine")
score = checker.compare(text1, text2)
# Returns: 0.0 to 1.0

Jaccard Similarity

Good for comparing sets of words/tokens:

checker = SimilarityChecker(method="jaccard")
score = checker.compare(text1, text2)
# Returns: 0.0 to 1.0

Levenshtein (Edit Distance)

Best for short texts, typo detection:

checker = SimilarityChecker(method="levenshtein")
score = checker.compare(text1, text2)
# Returns: 0.0 to 1.0 (normalized)

TF-IDF + Cosine

Advanced: considers term importance:

checker = SimilarityChecker(method="tfidf")
score = checker.compare(text1, text2)

Batch Comparison

Compare to Corpus

checker = SimilarityChecker()

target = "Machine learning is a subset of artificial intelligence."
corpus = [
    "AI includes machine learning and deep learning.",
    "Python is a programming language.",
    "Neural networks power deep learning systems."
]

results = checker.compare_to_corpus(target, corpus)

# Returns:
[
    {"index": 0, "similarity": 0.65, "text": "AI includes..."},
    {"index": 2, "similarity": 0.42, "text": "Neural networks..."},
    {"index": 1, "similarity": 0.12, "text": "Python is..."}
]

Similarity Matrix

documents = [
    "Document one content...",
    "Document two content...",
    "Document three content..."
]

matrix = checker.similarity_matrix(documents)

# Returns DataFrame:
#          doc_0    doc_1    doc_2
# doc_0    1.000    0.750    0.320
# doc_1    0.750    1.000    0.410
# doc_2    0.320    0.410    1.000

Find Duplicates

documents = [...]  # List of texts

duplicates = checker.find_duplicates(documents, threshold=0.85)

# Returns:
[
    {"doc1_index": 0, "doc2_index": 3, "similarity": 0.92},
    {"doc1_index": 2, "doc2_index": 7, "similarity": 0.88}
]

Compare All Methods

Get similarity scores from all algorithms:

checker = SimilarityChecker()
results = checker.compare_all_methods(text1, text2)

# Returns:
{
    "cosine": 0.82,
    "jaccard": 0.65,
    "levenshtein": 0.71,
    "tfidf": 0.78,
    "average": 0.74
}

Folder Operations

Compare All Files in Folder

checker = SimilarityChecker()
results = checker.compare_folder("./documents/")

# Returns:
{
    "files": ["doc1.txt", "doc2.txt", "doc3.txt"],
    "comparisons": 3,
    "similar_pairs": [
        {"file1": "doc1.txt", "file2": "doc3.txt", "similarity": 0.87}
    ],
    "matrix": <DataFrame>
}

Find Most Similar to Query

query = "Your search text here..."
results = checker.find_most_similar(query, "./documents/", top_n=5)

# Returns:
[
    {"file": "doc3.txt", "similarity": 0.89},
    {"file": "doc1.txt", "similarity": 0.72},
    ...
]

Output Format

Comparison Result

result = checker.compare_with_details(text1, text2)

# Returns:
{
    "similarity": 0.82,
    "method": "cosine",
    "text1_length": 150,
    "text2_length": 180,
    "common_words": 25,
    "unique_words_text1": 10,
    "unique_words_text2": 15,
    "interpretation": "High similarity - likely related content"
}

Example Workflows

Plagiarism Check

checker = SimilarityChecker()

submission = open("student_paper.txt").read()
results = checker.compare_folder("./source_materials/")

suspicious = [p for p in results["similar_pairs"] if p["similarity"] > 0.6]

if suspicious:
    print(f"Warning: Found {len(suspicious)} potentially similar sources")
    for p in suspicious:
        print(f"  {p['file1']} matches {p['file2']}: {p['similarity']:.0%}")

Document Deduplication

checker = SimilarityChecker()

# Load all documents
docs = {}
for file in Path("./articles/").glob("*.txt"):
    docs[file.name] = file.read_text()

# Find near-duplicates
duplicates = checker.find_duplicates(list(docs.values()), threshold=0.9)

print(f"Found {len(duplicates)} duplicate pairs")

Content Matching

checker = SimilarityChecker()

query = "Best practices for Python web development"
results = checker.find_most_similar(query, "./blog_posts/", top_n=10)

print("Most relevant articles:")
for r in results:
    print(f"  {r['file']}: {r['similarity']:.0%} match")

Dependencies

scikit-learn>=1.3.0
nltk>=3.8.0
numpy>=1.24.0
pandas>=2.0.0

Source

git clone https://github.com/dkyazzentwatwa/chatgpt-skills/blob/main/content-similarity-checker/SKILL.mdView on GitHub

Overview

Content-Similarity-Checker compares documents and text using multiple algorithms, including TF-IDF-based cosine similarity and set-based Jaccard similarity, with optional Levenshtein distance for short texts. It supports batch comparisons, similarity matrices, and detailed reports to aid plagiarism detection, duplicate finding, and content matching.

How This Skill Works

The tool tokenizes text, builds TF-IDF vectors, and computes cosine similarity. It also performs Jaccard similarity on token sets and uses Levenshtein distance for short texts. Advanced features include batch comparisons, similarity matrices, and reporting, accessible via Python API or CLI.

When to Use It

Detect plagiarism or unauthorized reuse across documents and submissions
Find near-duplicate content within a large corpus or CMS to reduce redundancy
Compare two versions of a document to quantify changes and similarity
Content matching across diverse sources for copyright checks or licensing
Batch analysis of many documents to generate a similarity matrix and reports

Quick Start

Step 1: from similarity_checker import SimilarityChecker
Step 2: checker = SimilarityChecker() # or checker = SimilarityChecker(method="jaccard")
Step 3: score = checker.compare(text1, text2) # or checker.compare_files("doc1.txt", "doc2.txt")

Best Practices

Normalize text before comparison (lowercase, remove punctuation, and stopwords if appropriate)
Choose the right method by use case: cosine for longer docs, Levenshtein for short texts, and Jaccard for token-based sets
Leverage batch comparisons and similarity matrices for large datasets to identify clusters
Review threshold-based results with human validation to avoid false positives
Use generated reports to document findings and support decisions

Example Use Cases

Academic institutions screening student submissions for potential plagiarism
A news site de-duplicating similar articles across multiple authors
A version control workflow comparing draft revisions to quantify similarity
A publishing house verifying content licensing by matching against external sources
An e-commerce site detecting near-duplicate product descriptions to improve SEO

Frequently Asked Questions

Add this skill to your agents