计算一种词袋中的距离"途径

时间:2016-08-24 14:34:49

标签: python count distance

我的代码运行但我的函数输出始终为0.0。我的代码调用.txt个文件并创建一个矩阵,其中每个.txt文件表示矩阵中的一行,.txt文件中的每个单词在矩阵的相应行中都有自己的列。

我成对比较线条。我想计算两条线的联合的每个单词出现的频率。但是,虽然代码运行但我得到了错误的结果(0.0)。

我以为我可能在函数矩阵中出错,但矩阵看起来不错。

奇怪的是,如果我手动创建列表:

a = ["a", "b", "c", "d"],
b = ["b", "c", "d", "e"]

它有效,但当我改为:

a = ["word 1", "word 2", "word 3", "word 4"],
b = ["word 2","word 3","word 4","word 5",] 

结果又是0.0。我很困惑!

我的代码:

def bow_distance(a, b):

    p = 0

    if len(a) > len(b):
        max_words = len(a)
    else:
        max_words = len(b)

    list_words_ab = list(set(a) | set(b))

    len_bow_matrix = len(list_words_ab)
    bow_matrix = numpy.zeros(shape = (3, len_bow_matrix), dtype = str)

    while p < len_bow_matrix:
        bow_matrix[0, p] = str(list_words_ab[p])
        p = p+1

    p = 0   

    while p < len_bow_matrix:
        bow_matrix[1, p] = a.count(bow_matrix[0, p])
        bow_matrix[2, p] = b.count(bow_matrix[0, p])
        p = p+1

    p = 0
    overlap = 0

    while p < len_bow_matrix:
        abs_difference = abs(float(bow_matrix[1, p]) - float(bow_matrix[2, p]))
        overlap = overlap + abs_difference
        p = p+1

    return (overlap/2)/max_num_parts


    # Calculate the distances

i = 1
j = 1

while i < num_of_txt + 1:

    print(i)
    newfile = open("TXT_distance_" + str(i)+".txt", "w")

    while j < num_of_txt + 1:
        newfile.write(str(bow_distance(text_word_matrix[i-1], text_word_matrix[j-1])) + " ")
        j = j+1

    newfile.close()
    j = 1
    i = i+1

1 个答案:

答案 0 :(得分:0)

第一眼看到我在这里看到两次失败:

a = ["a", "b", "c", "d"], <----- comma here 
b = ["b", "c", "d", "e"]
it works, but when I change to:

a = ["word 1", "word 2", "word 3", "word 4"], <----- and here 
b = ["word 2","word 3","word 4","word 5",] <----- and here inside the list