content-similarity-checker
npx machina-cli add skill dkyazzentwatwa/chatgpt-skills/content-similarity-checker --openclawContent Similarity Checker
Compare documents and text for similarity using multiple algorithms.
Features
- Cosine Similarity: TF-IDF based comparison
- Jaccard Similarity: Set-based comparison
- Levenshtein Distance: Edit distance for short texts
- Batch Comparison: Compare multiple documents
- Similarity Matrix: Pairwise comparison of all documents
- Reports: Detailed similarity reports
Quick Start
from similarity_checker import SimilarityChecker
checker = SimilarityChecker()
# Compare two texts
score = checker.compare(
"The quick brown fox jumps over the lazy dog",
"A fast brown fox leaps over a sleepy dog"
)
print(f"Similarity: {score:.2%}")
# Compare documents
score = checker.compare_files("doc1.txt", "doc2.txt")
CLI Usage
# Compare two texts
python similarity_checker.py --text1 "Hello world" --text2 "Hello there world"
# Compare two files
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt
# Compare all files in folder
python similarity_checker.py --folder ./documents/ --output matrix.csv
# Use specific algorithm
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --method jaccard
# Find similar documents (threshold)
python similarity_checker.py --folder ./documents/ --threshold 0.7
# JSON output
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --json
API Reference
SimilarityChecker Class
class SimilarityChecker:
def __init__(self, method: str = "cosine")
# Text comparison
def compare(self, text1: str, text2: str) -> float
def compare_files(self, file1: str, file2: str) -> float
# Multiple algorithms
def compare_all_methods(self, text1: str, text2: str) -> dict
# Batch comparison
def compare_to_corpus(self, text: str, corpus: list) -> list
def similarity_matrix(self, documents: list) -> pd.DataFrame
def find_duplicates(self, documents: list, threshold: float = 0.8) -> list
# Folder operations
def compare_folder(self, folder: str, threshold: float = None) -> dict
def find_most_similar(self, text: str, folder: str, top_n: int = 5) -> list
# Report
def generate_report(self, output: str) -> str
Similarity Methods
Cosine Similarity (Default)
Best for comparing documents of different lengths:
checker = SimilarityChecker(method="cosine")
score = checker.compare(text1, text2)
# Returns: 0.0 to 1.0
Jaccard Similarity
Good for comparing sets of words/tokens:
checker = SimilarityChecker(method="jaccard")
score = checker.compare(text1, text2)
# Returns: 0.0 to 1.0
Levenshtein (Edit Distance)
Best for short texts, typo detection:
checker = SimilarityChecker(method="levenshtein")
score = checker.compare(text1, text2)
# Returns: 0.0 to 1.0 (normalized)
TF-IDF + Cosine
Advanced: considers term importance:
checker = SimilarityChecker(method="tfidf")
score = checker.compare(text1, text2)
Batch Comparison
Compare to Corpus
checker = SimilarityChecker()
target = "Machine learning is a subset of artificial intelligence."
corpus = [
"AI includes machine learning and deep learning.",
"Python is a programming language.",
"Neural networks power deep learning systems."
]
results = checker.compare_to_corpus(target, corpus)
# Returns:
[
{"index": 0, "similarity": 0.65, "text": "AI includes..."},
{"index": 2, "similarity": 0.42, "text": "Neural networks..."},
{"index": 1, "similarity": 0.12, "text": "Python is..."}
]
Similarity Matrix
documents = [
"Document one content...",
"Document two content...",
"Document three content..."
]
matrix = checker.similarity_matrix(documents)
# Returns DataFrame:
# doc_0 doc_1 doc_2
# doc_0 1.000 0.750 0.320
# doc_1 0.750 1.000 0.410
# doc_2 0.320 0.410 1.000
Find Duplicates
documents = [...] # List of texts
duplicates = checker.find_duplicates(documents, threshold=0.85)
# Returns:
[
{"doc1_index": 0, "doc2_index": 3, "similarity": 0.92},
{"doc1_index": 2, "doc2_index": 7, "similarity": 0.88}
]
Compare All Methods
Get similarity scores from all algorithms:
checker = SimilarityChecker()
results = checker.compare_all_methods(text1, text2)
# Returns:
{
"cosine": 0.82,
"jaccard": 0.65,
"levenshtein": 0.71,
"tfidf": 0.78,
"average": 0.74
}
Folder Operations
Compare All Files in Folder
checker = SimilarityChecker()
results = checker.compare_folder("./documents/")
# Returns:
{
"files": ["doc1.txt", "doc2.txt", "doc3.txt"],
"comparisons": 3,
"similar_pairs": [
{"file1": "doc1.txt", "file2": "doc3.txt", "similarity": 0.87}
],
"matrix": <DataFrame>
}
Find Most Similar to Query
query = "Your search text here..."
results = checker.find_most_similar(query, "./documents/", top_n=5)
# Returns:
[
{"file": "doc3.txt", "similarity": 0.89},
{"file": "doc1.txt", "similarity": 0.72},
...
]
Output Format
Comparison Result
result = checker.compare_with_details(text1, text2)
# Returns:
{
"similarity": 0.82,
"method": "cosine",
"text1_length": 150,
"text2_length": 180,
"common_words": 25,
"unique_words_text1": 10,
"unique_words_text2": 15,
"interpretation": "High similarity - likely related content"
}
Example Workflows
Plagiarism Check
checker = SimilarityChecker()
submission = open("student_paper.txt").read()
results = checker.compare_folder("./source_materials/")
suspicious = [p for p in results["similar_pairs"] if p["similarity"] > 0.6]
if suspicious:
print(f"Warning: Found {len(suspicious)} potentially similar sources")
for p in suspicious:
print(f" {p['file1']} matches {p['file2']}: {p['similarity']:.0%}")
Document Deduplication
checker = SimilarityChecker()
# Load all documents
docs = {}
for file in Path("./articles/").glob("*.txt"):
docs[file.name] = file.read_text()
# Find near-duplicates
duplicates = checker.find_duplicates(list(docs.values()), threshold=0.9)
print(f"Found {len(duplicates)} duplicate pairs")
Content Matching
checker = SimilarityChecker()
query = "Best practices for Python web development"
results = checker.find_most_similar(query, "./blog_posts/", top_n=10)
print("Most relevant articles:")
for r in results:
print(f" {r['file']}: {r['similarity']:.0%} match")
Dependencies
- scikit-learn>=1.3.0
- nltk>=3.8.0
- numpy>=1.24.0
- pandas>=2.0.0
Source
git clone https://github.com/dkyazzentwatwa/chatgpt-skills/blob/main/content-similarity-checker/SKILL.mdView on GitHub Overview
Content-Similarity-Checker compares documents and text using multiple algorithms, including TF-IDF-based cosine similarity and set-based Jaccard similarity, with optional Levenshtein distance for short texts. It supports batch comparisons, similarity matrices, and detailed reports to aid plagiarism detection, duplicate finding, and content matching.
How This Skill Works
The tool tokenizes text, builds TF-IDF vectors, and computes cosine similarity. It also performs Jaccard similarity on token sets and uses Levenshtein distance for short texts. Advanced features include batch comparisons, similarity matrices, and reporting, accessible via Python API or CLI.
When to Use It
- Detect plagiarism or unauthorized reuse across documents and submissions
- Find near-duplicate content within a large corpus or CMS to reduce redundancy
- Compare two versions of a document to quantify changes and similarity
- Content matching across diverse sources for copyright checks or licensing
- Batch analysis of many documents to generate a similarity matrix and reports
Quick Start
- Step 1: from similarity_checker import SimilarityChecker
- Step 2: checker = SimilarityChecker() # or checker = SimilarityChecker(method="jaccard")
- Step 3: score = checker.compare(text1, text2) # or checker.compare_files("doc1.txt", "doc2.txt")
Best Practices
- Normalize text before comparison (lowercase, remove punctuation, and stopwords if appropriate)
- Choose the right method by use case: cosine for longer docs, Levenshtein for short texts, and Jaccard for token-based sets
- Leverage batch comparisons and similarity matrices for large datasets to identify clusters
- Review threshold-based results with human validation to avoid false positives
- Use generated reports to document findings and support decisions
Example Use Cases
- Academic institutions screening student submissions for potential plagiarism
- A news site de-duplicating similar articles across multiple authors
- A version control workflow comparing draft revisions to quantify similarity
- A publishing house verifying content licensing by matching against external sources
- An e-commerce site detecting near-duplicate product descriptions to improve SEO