Question

我正在研究使用任何编程语言编写NLP项目（虽然Python将是我的偏好）。

我想拿两份文件，确定它们有多相似。

Answer 1

这样做的常用方法是将文档转换为tf-idf向量，然后计算它们之间的余弦相似度。任何有关信息检索（IR）的教科书都涵盖了这一点。尤其是Introduction to Information Retrieval，免费且可在线获取。

Tf-idf（以及类似的文本转换）在Python包Gensim和scikit-learn中实现。在后一种方案中，计算余弦相似度就像

一样简单

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f) for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

或者，如果文件是简单的字符串，

>>> vect = TfidfVectorizer(min_df=1)
>>> tfidf = vect.fit_transform(["I'd like an apple",
...                             "An apple a day keeps the doctor away",
...                             "Never compare an apple to an orange",
...                             "I prefer scikit-learn to Orange"])
>>> (tfidf * tfidf.T).A
array([[ 1.        ,  0.25082859,  0.39482963,  0.        ],
       [ 0.25082859,  1.        ,  0.22057609,  0.        ],
       [ 0.39482963,  0.22057609,  1.        ,  0.26264139],
       [ 0.        ,  0.        ,  0.26264139,  1.        ]])

虽然Gensim可能有更多选择来完成这类任务。

另见this question。

[免责声明：我参与了scikit-learn tf-idf实施。]

Answer 2

与@larsman相同，但有一些预处理

import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('punkt') # if necessary...


stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def stem_tokens(tokens):
    return [stemmer.stem(item) for item in tokens]

'''remove punctuation, lowercase, stem'''
def normalize(text):
    return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')

def cosine_sim(text1, text2):
    tfidf = vectorizer.fit_transform([text1, text2])
    return ((tfidf * tfidf.T).A)[0,1]


print cosine_sim('a little bird', 'a little bird')
print cosine_sim('a little bird', 'a little bird chirps')
print cosine_sim('a little bird', 'a big dog barks')

Answer 3

这是一个老问题，但我发现这可以通过Spacy轻松完成。读取文档后，可以使用简单的api similarity来查找文档向量之间的余弦相似度。

import spacy
nlp = spacy.load('en')
doc1 = nlp(u'Hello hi there!')
doc2 = nlp(u'Hello hi there!')
doc3 = nlp(u'Hey whatsup?')

print doc1.similarity(doc2) # 0.999999954642
print doc2.similarity(doc3) # 0.699032527716
print doc1.similarity(doc3) # 0.699032527716

Answer 4

通常，两个文档之间的余弦相似度用作文档的相似性度量。在Java中，您可以使用Lucene（如果您的集合非常大）或LingPipe来执行此操作。基本概念是计算每个文档中的术语并计算术语向量的点积。这些库确实提供了对这种通用方法的若干改进，例如，使用逆文档频率并计算tf-idf向量。如果你想做一些copmlex，LingPipe还提供了计算文档之间LSA相似性的方法，它提供了比余弦相似性更好的结果。对于Python，您可以使用NLTK。

Answer 5

这是一个让你入门的小应用程序......

import difflib as dl

a = file('file').read()
b = file('file1').read()

sim = dl.get_close_matches

s = 0
wa = a.split()
wb = b.split()

for i in wa:
    if sim(i, wb):
        s += 1

n = float(s) / float(len(wa))
print '%d%% similarity' % int(n * 100)

Answer 6

要查找具有很少数据集的句子相似度并获得较高的准确性，您可以在下面的python包中使用预训练的BERT模型，

pip install similar-sentences

Answer 7

您可能希望尝试此在线服务以获取余弦文档相似性http://www.scurtu.it/documentSimilarity.html

import urllib,urllib2
import json
API_URL="http://www.scurtu.it/apis/documentSimilarity"
inputDict={}
inputDict['doc1']='Document with some text'
inputDict['doc2']='Other document with some text'
params = urllib.urlencode(inputDict)    
f = urllib2.urlopen(API_URL, params)
response= f.read()
responseObject=json.loads(response)  
print responseObject

Answer 8

如果您要查找非常准确的内容，则需要使用比tf-idf更好的工具。 Universal sentence encoder是找到任意两段文本之间相似度的最准确方法之一。 Google提供了预先训练的模型，您可以将其用于自己的应用程序，而无需从头开始训练任何东西。首先，您必须安装tensorflow和tensorflow-hub：

    pip install tensorflow
    pip install tensorflow_hub

下面的代码使您可以将任何文本转换为固定长度的矢量表示形式，然后可以使用点积来找出它们之间的相似性

module_url = "https://tfhub.dev/google/universal-sentence-encoder/1?tf-hub-format=compressed"

# Import the Universal Sentence Encoder's TF Hub module
embed = hub.Module(module_url)

# sample text
messages = [
# Smartphones
"My phone is not good.",
"Your cellphone looks great.",

# Weather
"Will it snow tomorrow?",
"Recently a lot of hurricanes have hit the US",

# Food and health
"An apple a day, keeps the doctors away",
"Eating strawberries is healthy",
]

similarity_input_placeholder = tf.placeholder(tf.string, shape=(None))
similarity_message_encodings = embed(similarity_input_placeholder)
with tf.Session() as session:
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    message_embeddings_ = session.run(similarity_message_encodings, feed_dict={similarity_input_placeholder: messages})

    corr = np.inner(message_embeddings_, message_embeddings_)
    print(corr)
    heatmap(messages, messages, corr)

和绘图代码：

def heatmap(x_labels, y_labels, values):
    fig, ax = plt.subplots()
    im = ax.imshow(values)

    # We want to show all ticks...
    ax.set_xticks(np.arange(len(x_labels)))
    ax.set_yticks(np.arange(len(y_labels)))
    # ... and label them with the respective list entries
    ax.set_xticklabels(x_labels)
    ax.set_yticklabels(y_labels)

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right", fontsize=10,
         rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    for i in range(len(y_labels)):
        for j in range(len(x_labels)):
            text = ax.text(j, i, "%.2f"%values[i, j],
                           ha="center", va="center", color="w", 
fontsize=6)

    fig.tight_layout()
    plt.show()

结果将是：

如您所见，最相似的是文本之间的相互关系，然后是其含义紧密的文本。

重要：第一次运行代码会很慢，因为它需要下载模型。如果要防止它再次下载模型并使用本地模型，则必须创建一个用于缓存的文件夹并将其添加到环境变量中，然后在第一次运行后使用该路径：

tf_hub_cache_dir = "universal_encoder_cached/"
os.environ["TFHUB_CACHE_DIR"] = tf_hub_cache_dir

# pointing to the folder inside cache dir, it will be unique on your system
module_url = tf_hub_cache_dir+"/d8fbeb5c580e50f975ef73e80bebba9654228449/"
embed = hub.Module(module_url)

更多信息：https://tfhub.dev/google/universal-sentence-encoder/2

Answer 9

如果您对测量两段文字的语义相似性更感兴趣，建议您查看this gitlab project。您可以将其作为服务器运行，还有一个预先构建的模型，您可以轻松地使用它来测量两个文本的相似性;即使它主要用于测量两个句子的相似性，你仍然可以在你的case中使用它。它是用java编写的，但你可以将它作为RESTful服务运行。

另一个选项也是DKPro Similarity，它是一个具有各种算法来测量文本相似性的库。但是，它也是用java编写的。

Answer 10

我正在结合@FredFoo 和@Renaud 的答案中的解决方案。我的解决方案能够对@FredFoo 的文本语料库应用@Renaud 的预处理，然后在相似度大于0 的情况下显示成对相似度。我通过首先安装python 和pip 在Windows 上运行此代码。 pip 是作为 python 的一部分安装的，但您可能必须通过重新运行安装包，选择修改，然后选择 pip 来明确执行此操作。我使用命令行来执行保存在“similarity.py”文件中的我的 python 代码。我必须执行以下命令：

>set PYTHONPATH=%PYTHONPATH%;C:\_location_of_python_lib_
>python -m pip install sklearn
>python -m pip install nltk
>py similarity.py

similarity.py 的代码如下：

from sklearn.feature_extraction.text import TfidfVectorizer
import nltk, string
import numpy as np
nltk.download('punkt') # if necessary...

stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def stem_tokens(tokens):
    return [stemmer.stem(item) for item in tokens]

def normalize(text):
    return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

corpus = ["I'd like an apple", 
           "An apple a day keeps the doctor away", 
           "Never compare an apple to an orange", 
           "I prefer scikit-learn to Orange", 
           "The scikit-learn docs are Orange and Blue"]  

vect = TfidfVectorizer(tokenizer=normalize, stop_words='english')
tfidf = vect.fit_transform(corpus)   
                                                                                                                                                                                                                    
pairwise_similarity = tfidf * tfidf.T

#view the pairwise similarities 
print(pairwise_similarity)

#check how a string is normalized
print(normalize("The scikit-learn docs are Orange and Blue"))

Answer 11

语法相似性可以通过3种简单的方法来检测相似性。

Word2Vec
手套
Tfidf或countvectorizer

对于语义相似性可以使用BERT嵌入并尝试不同的词池策略来获取文档嵌入，然后将余弦相似度应用于文档嵌入。

一种先进的方法可以使用BERT SCORE获得相似性。

研究论文链接：https://arxiv.org/abs/1904.09675

如何计算两个文本文档之间的相似度？

11 个答案: