文本句子的语义相似性

时间:2017-01-11 15:57:20

标签: python vector tf-idf sentence-similarity latent-semantic-analysis

我使用了来自here的材料和之前的论坛页面为程序编写了一些代码,这些代码将自动计算整个文本中连续句子之间的语义相似度。这是;

第一部分的代码是从第一个链接复制粘贴,然后我在245行之后输入了这些东西。我在第245行之后删除了所有多余的东西。

with open ("File_Name", "r") as sentence_file:
    while x and y:
        x = sentence_file.readline()
        y = sentence_file.readline()
        similarity(x, y, true)           
#boolean set to false or true 
        x = y
        y = sentence_file.readline() 

我的文本文件格式如下;

  

红酒精饮料。新鲜的橙汁。英语词典。该   黄色壁纸。

最后,我想显示所有连续句子对,旁边有相似之处,就像这样;

["Red alcoholic drink.", "Fresh orange juice.", 0.611],

["Fresh orange juice.", "An English dictionary.", 0.0]

["An English dictionary.", "The Yellow Wallpaper.",  0.5]

if norm(vec_1) > 0 and if norm(vec_2) > 0:
    return np.dot(vec_1, vec_2.T) / (np.linalg.norm(vec_1)* np.linalg.norm(vec_2))
 elif norm(vec_1) < 0 and if norm(vec_2) < 0:
    ???Move On???

1 个答案:

答案 0 :(得分:0)

这应该有效。评论中有一些注意事项。基本上,您可以遍历文件中的行并存储结果。一次处理两行的一种方法是设置一个&#34;无限循环&#34;并查看我们已阅读的最后一行,看看我们是否已达到最后一行(readline()将在文件末尾返回None

# You'll probably need the file extention (.txt or whatever) in open as well
with open ("File_Name.txt", "r") as sentence_file:
    # Initialize a list to hold the results
    results = []

    # Loop until we hit the end of the file
    while True:
        # Read two lines
        x = sentence_file.readline()
        y = sentence_file.readline()

        # Check if we've reached the end of the file, if so, we're done
        if not y:
            # Break out of the infinite loop
            break
        else:
            # The .rstrip('\n') removes the newline character from each line
            x = x.rstrip('\n')
            y = y.rstrip('\n')

            try: 
                # Calculate your similarity value
                similarity_value = similarity(x, y, True)

                # Add the two lines and similarity value to the results list
                results.append([x, y, similarity_value])
            except:
                print("Error when parsing lines:\n{}\n{}\n".format(x, y))

# Loop through the pairs in the results list and print them
for pair in results:
    print(pair)

编辑:关于您从similarity()获得的问题,如果您想简单地忽略导致这些错误的线对(不深入查看来源,我真的不知道是什么& #39; s继续),您可以在try, catch的呼叫周围添加similarity()