Get the FREE Ultimate OpenClaw Setup Guide →

content-similarity-checker

npx machina-cli add skill dkyazzentwatwa/chatgpt-skills/content-similarity-checker --openclaw
Files (1)
SKILL.md
6.7 KB

Content Similarity Checker

Compare documents and text for similarity using multiple algorithms.

Features

  • Cosine Similarity: TF-IDF based comparison
  • Jaccard Similarity: Set-based comparison
  • Levenshtein Distance: Edit distance for short texts
  • Batch Comparison: Compare multiple documents
  • Similarity Matrix: Pairwise comparison of all documents
  • Reports: Detailed similarity reports

Quick Start

from similarity_checker import SimilarityChecker

checker = SimilarityChecker()

# Compare two texts
score = checker.compare(
    "The quick brown fox jumps over the lazy dog",
    "A fast brown fox leaps over a sleepy dog"
)
print(f"Similarity: {score:.2%}")

# Compare documents
score = checker.compare_files("doc1.txt", "doc2.txt")

CLI Usage

# Compare two texts
python similarity_checker.py --text1 "Hello world" --text2 "Hello there world"

# Compare two files
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt

# Compare all files in folder
python similarity_checker.py --folder ./documents/ --output matrix.csv

# Use specific algorithm
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --method jaccard

# Find similar documents (threshold)
python similarity_checker.py --folder ./documents/ --threshold 0.7

# JSON output
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --json

API Reference

SimilarityChecker Class

class SimilarityChecker:
    def __init__(self, method: str = "cosine")

    # Text comparison
    def compare(self, text1: str, text2: str) -> float
    def compare_files(self, file1: str, file2: str) -> float

    # Multiple algorithms
    def compare_all_methods(self, text1: str, text2: str) -> dict

    # Batch comparison
    def compare_to_corpus(self, text: str, corpus: list) -> list
    def similarity_matrix(self, documents: list) -> pd.DataFrame
    def find_duplicates(self, documents: list, threshold: float = 0.8) -> list

    # Folder operations
    def compare_folder(self, folder: str, threshold: float = None) -> dict
    def find_most_similar(self, text: str, folder: str, top_n: int = 5) -> list

    # Report
    def generate_report(self, output: str) -> str

Similarity Methods

Cosine Similarity (Default)

Best for comparing documents of different lengths:

checker = SimilarityChecker(method="cosine")
score = checker.compare(text1, text2)
# Returns: 0.0 to 1.0

Jaccard Similarity

Good for comparing sets of words/tokens:

checker = SimilarityChecker(method="jaccard")
score = checker.compare(text1, text2)
# Returns: 0.0 to 1.0

Levenshtein (Edit Distance)

Best for short texts, typo detection:

checker = SimilarityChecker(method="levenshtein")
score = checker.compare(text1, text2)
# Returns: 0.0 to 1.0 (normalized)

TF-IDF + Cosine

Advanced: considers term importance:

checker = SimilarityChecker(method="tfidf")
score = checker.compare(text1, text2)

Batch Comparison

Compare to Corpus

checker = SimilarityChecker()

target = "Machine learning is a subset of artificial intelligence."
corpus = [
    "AI includes machine learning and deep learning.",
    "Python is a programming language.",
    "Neural networks power deep learning systems."
]

results = checker.compare_to_corpus(target, corpus)

# Returns:
[
    {"index": 0, "similarity": 0.65, "text": "AI includes..."},
    {"index": 2, "similarity": 0.42, "text": "Neural networks..."},
    {"index": 1, "similarity": 0.12, "text": "Python is..."}
]

Similarity Matrix

documents = [
    "Document one content...",
    "Document two content...",
    "Document three content..."
]

matrix = checker.similarity_matrix(documents)

# Returns DataFrame:
#          doc_0    doc_1    doc_2
# doc_0    1.000    0.750    0.320
# doc_1    0.750    1.000    0.410
# doc_2    0.320    0.410    1.000

Find Duplicates

documents = [...]  # List of texts

duplicates = checker.find_duplicates(documents, threshold=0.85)

# Returns:
[
    {"doc1_index": 0, "doc2_index": 3, "similarity": 0.92},
    {"doc1_index": 2, "doc2_index": 7, "similarity": 0.88}
]

Compare All Methods

Get similarity scores from all algorithms:

checker = SimilarityChecker()
results = checker.compare_all_methods(text1, text2)

# Returns:
{
    "cosine": 0.82,
    "jaccard": 0.65,
    "levenshtein": 0.71,
    "tfidf": 0.78,
    "average": 0.74
}

Folder Operations

Compare All Files in Folder

checker = SimilarityChecker()
results = checker.compare_folder("./documents/")

# Returns:
{
    "files": ["doc1.txt", "doc2.txt", "doc3.txt"],
    "comparisons": 3,
    "similar_pairs": [
        {"file1": "doc1.txt", "file2": "doc3.txt", "similarity": 0.87}
    ],
    "matrix": <DataFrame>
}

Find Most Similar to Query

query = "Your search text here..."
results = checker.find_most_similar(query, "./documents/", top_n=5)

# Returns:
[
    {"file": "doc3.txt", "similarity": 0.89},
    {"file": "doc1.txt", "similarity": 0.72},
    ...
]

Output Format

Comparison Result

result = checker.compare_with_details(text1, text2)

# Returns:
{
    "similarity": 0.82,
    "method": "cosine",
    "text1_length": 150,
    "text2_length": 180,
    "common_words": 25,
    "unique_words_text1": 10,
    "unique_words_text2": 15,
    "interpretation": "High similarity - likely related content"
}

Example Workflows

Plagiarism Check

checker = SimilarityChecker()

submission = open("student_paper.txt").read()
results = checker.compare_folder("./source_materials/")

suspicious = [p for p in results["similar_pairs"] if p["similarity"] > 0.6]

if suspicious:
    print(f"Warning: Found {len(suspicious)} potentially similar sources")
    for p in suspicious:
        print(f"  {p['file1']} matches {p['file2']}: {p['similarity']:.0%}")

Document Deduplication

checker = SimilarityChecker()

# Load all documents
docs = {}
for file in Path("./articles/").glob("*.txt"):
    docs[file.name] = file.read_text()

# Find near-duplicates
duplicates = checker.find_duplicates(list(docs.values()), threshold=0.9)

print(f"Found {len(duplicates)} duplicate pairs")

Content Matching

checker = SimilarityChecker()

query = "Best practices for Python web development"
results = checker.find_most_similar(query, "./blog_posts/", top_n=10)

print("Most relevant articles:")
for r in results:
    print(f"  {r['file']}: {r['similarity']:.0%} match")

Dependencies

  • scikit-learn>=1.3.0
  • nltk>=3.8.0
  • numpy>=1.24.0
  • pandas>=2.0.0

Source

git clone https://github.com/dkyazzentwatwa/chatgpt-skills/blob/main/content-similarity-checker/SKILL.mdView on GitHub

Overview

Content-Similarity-Checker compares documents and text using multiple algorithms, including TF-IDF-based cosine similarity and set-based Jaccard similarity, with optional Levenshtein distance for short texts. It supports batch comparisons, similarity matrices, and detailed reports to aid plagiarism detection, duplicate finding, and content matching.

How This Skill Works

The tool tokenizes text, builds TF-IDF vectors, and computes cosine similarity. It also performs Jaccard similarity on token sets and uses Levenshtein distance for short texts. Advanced features include batch comparisons, similarity matrices, and reporting, accessible via Python API or CLI.

When to Use It

  • Detect plagiarism or unauthorized reuse across documents and submissions
  • Find near-duplicate content within a large corpus or CMS to reduce redundancy
  • Compare two versions of a document to quantify changes and similarity
  • Content matching across diverse sources for copyright checks or licensing
  • Batch analysis of many documents to generate a similarity matrix and reports

Quick Start

  1. Step 1: from similarity_checker import SimilarityChecker
  2. Step 2: checker = SimilarityChecker() # or checker = SimilarityChecker(method="jaccard")
  3. Step 3: score = checker.compare(text1, text2) # or checker.compare_files("doc1.txt", "doc2.txt")

Best Practices

  • Normalize text before comparison (lowercase, remove punctuation, and stopwords if appropriate)
  • Choose the right method by use case: cosine for longer docs, Levenshtein for short texts, and Jaccard for token-based sets
  • Leverage batch comparisons and similarity matrices for large datasets to identify clusters
  • Review threshold-based results with human validation to avoid false positives
  • Use generated reports to document findings and support decisions

Example Use Cases

  • Academic institutions screening student submissions for potential plagiarism
  • A news site de-duplicating similar articles across multiple authors
  • A version control workflow comparing draft revisions to quantify similarity
  • A publishing house verifying content licensing by matching against external sources
  • An e-commerce site detecting near-duplicate product descriptions to improve SEO

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers