Question

我有一个数据框，其中包含27949行和7列，并且前几行如下所示 https://i.stack.imgur.com/1Pipf.png

任务：在数据框中，我有一个“标题”列，其中有许多重复的标题，我想删除（重复的标题：除了1个或2个单词，几乎所有标题都是相同的）。伪代码：我想检查所有其他行的第一行，如果其中任何一个是重复的，我想将其删除。然后我想检查所有其他行的第二行，如果其中任何一个是重复的，我想将其删除-与所有行类似，即i =第一行到最后一行j = i + 1到最后一行。我的代码：

for i in range(0,27950):
    for j in range(1,27950):
        a = data_sorted['title'].iloc[i].split()
        b = data_sorted['title'].iloc[j].split()
        if len(a)-len(b)<=2:
            data_sorted.drop(b)
            j=j
        else:
            j+=1
    i+=1

错误： IndexError：单个位置索引器超出范围

任何人都可以帮我提供我的代码。预先感谢。

Answer 1

我建议采用以下方法：

建立标题的差异矩阵，其中i，j元素代表第i个标题和第j个标题之间的单词差异。

像这样：

    import numpy as np
    from itertools import product

    l = list(data_sorted['title'])

    def diff_words(text_1, text_2):
        # return the number of different words between two texts
        words_1 = text_1.split()
        words_2 = text_2.split()
        diff = max(len(words_1),len(words_2))-len(np.intersect1d(words_1, words_2))
        return diff


    differences = [diff_words(i,j) for i,j in product(l,l)]
    # differences: a flat matrix integers where the i,j element is the word difference between titles i and j

在Python中删除数据框中的重复行

1 个答案: