我有一个包含数百个txt文件的文件夹,我需要分析它们的相似性。下面是我用来运行相似性分析的脚本示例。最后我得到了一个我可以绘制的数组或矩阵等。
我想看看cos_similarity > 0.5
(或我决定使用的任何其他阈值)有多少对,当我比较相同的文件时删除cos_similarity == 1
。
其次,我需要基于文件名的这些对的列表。
因此下面示例的输出如下:
1
和
["doc1", "doc4"]
非常感谢你的帮助,因为我觉得有点迷失,不知道要去哪个方向。
这是我获取矩阵的脚本示例:
doc1 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints that it is failing to meet that pledge."
doc2 = "The BBC has been inundated with comments from Amazon Prime customers. Most reported problems with deliveries."
doc3 = "An Amazon spokesman told the BBC the ASA had confirmed to it there was no investigation at this time."
doc4 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints..."
documents = [doc1, doc2, doc3, doc4]
# In my real script I iterate through a folder (path) with txt files like this:
#def read_text(path):
# documents = []
# for filename in glob.iglob(path+'*.txt'):
# _file = open(filename, 'r')
# text = _file.read()
# documents.append(text)
# return documents
import nltk, string, numpy
nltk.download('punkt') # first-time use only
stemmer = nltk.stem.porter.PorterStemmer()
def StemTokens(tokens):
return [stemmer.stem(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def StemNormalize(text):
return StemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
nltk.download('wordnet') # first-time use only
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
from sklearn.feature_extraction.text import CountVectorizer
LemVectorizer = CountVectorizer(tokenizer=LemNormalize, stop_words='english')
LemVectorizer.fit_transform(documents)
tf_matrix = LemVectorizer.transform(documents).toarray()
from sklearn.feature_extraction.text import TfidfTransformer
tfidfTran = TfidfTransformer(norm="l2")
tfidfTran.fit(tf_matrix)
tfidf_matrix = tfidfTran.transform(tf_matrix)
cos_similarity_matrix = (tfidf_matrix * tfidf_matrix.T).toarray()
from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
def cos_similarity(textlist):
tfidf = TfidfVec.fit_transform(textlist)
return (tfidf * tfidf.T).toarray()
cos_similarity(documents)
输出:
array([[ 1. , 0.1459739 , 0.03613371, 0.76357693],
[ 0.1459739 , 1. , 0.11459266, 0.19117117],
[ 0.03613371, 0.11459266, 1. , 0.04732164],
[ 0.76357693, 0.19117117, 0.04732164, 1. ]])
答案 0 :(得分:1)
正如我理解你的问题,你想创建一个函数来读取输出numpy数组和一个特定的值(阈值),以便返回两件事:
所以,我在这里做了以下函数,该函数有三个参数:
cos_similarity()
函数的输出numpy数组。在这里:
def get_docs(arr, docs_names, threshold):
output_tuples = []
for row in range(len(arr)):
lst = [row+1+idx for idx, num in \
enumerate(arr[row, row+1:]) if num >= threshold]
for item in lst:
output_tuples.append( (docs_names[row], docs_names[item]) )
return len(output_tuples), output_tuples
让我们看看它的实际效果:
>>> docs_names = ["doc1", "doc2", "doc3", "doc4"]
>>> arr = cos_similarity(documents)
>>> arr
array([[ 1. , 0.1459739 , 0.03613371, 0.76357693],
[ 0.1459739 , 1. , 0.11459266, 0.19117117],
[ 0.03613371, 0.11459266, 1. , 0.04732164],
[ 0.76357693, 0.19117117, 0.04732164, 1. ]])
>>> threshold = 0.5
>>> get_docs(arr, docs_names, threshold)
(1, [('doc1', 'doc4')])
>>> get_docs(arr, docs_names, 1)
(0, [])
>>> get_docs(lst, docs_names, 0.13)
(3, [('doc1', 'doc2'), ('doc1', 'doc4'), ('doc2', 'doc4')])
让我们看看这个功能是如何运作的: