Question

我想将python列表中的文本相互比较。例如

Url         | text
            |
www.xyz.com | " hello bha njik **bhavd bhavd** bjavd manhbd kdkndsik wkjdk"
            | 
www.abc.com | "bhavye jsbsdv sjbs jcsbjd adjbsd jdfhjdb jdshbjf jdsbjf"
            |
www.lokj.com| "bsjgad adhuad jadshjasd kdashda kdajikd kdfsj **bhavd bhavd** "

现在我想将第一个文本与其他行进行比较，以便知道文本中有多少单词相似。并逐步增加第二行，包括以下行......等等。

我应该采用什么方法，我应该使用什么数据结构？

Answer 1

对于python3

如评论中所详述，我们生成每个可能的对，创建集以确保单词的唯一性，我们只计算每对唯一常用单词的数量。如果您的文本列表结构有点不同，可能需要稍微调整一下

marker.getAttribute('artoolkitmarker').arController

旁注：根据您的目的，您可能需要查看 TFIDF （A simple tutorial）等算法，以了解文本/文档的相似性，或许多其他...

Answer 2

您可以使用OrderedDict()的最佳方式，这对于维护取出dict keys的订单非常有用。

通过迭代该字典，比较值，您将获得输出

Answer 3

一种可能的方法是将每个字符串转换为一组单词，然后比较集合的交集

string_1 = "hello bha njik bhavd bhavd bjavd manhbd kdkndsik wkjdk"
string_2 = "bhavd dskghfski fjfbhskf ewkjhsdkifs fjuekdjsdf ue"

# First split your strings into sets of words
set_1 = set(string_1.split())
set_2 = set(string_2.split())

# Compare the sets to find where they both have the same value
print set_1 & set_2
print set_1.intersection(set_2)

# Both print out {'bhavd'}

比较python中表中的文本

3 个答案: