Question

我有一个数据框，我想连接某些列。

我的问题是这些列中的文字可能包含也可能不包含重复信息。我想删除重复项，以便仅保留相关信息。

例如，如果我有一个数据框，例如：

pd.read_csv("animal.csv")

  animal1         animal2        label  
1 cat dog         dolphin        19
2 dog cat         cat            72
3 pilchard 26     koala          26
4 newt bat 81     bat            81

我想组合列，但只保留每个字符串的唯一信息。

你可以在第2行看到“猫”和“猫”。包含在两个栏目中的动物1＆＃39;和动物2＆＃39;。在第3行中，数字26位于“动物1”和“动物1”中。和＆＃39;标签＆＃39;。而在第4行中，列为“动物2”和“动物2”的信息。和＆＃39;标签＆＃39;已经包含在动物1＆＃39;。

中

我通过执行以下操作来组合列

animals["detail"] = animals["animal1"].map(str) + animals["animal2"].map(str) + animals["label"].map(str)

  animal1         animal2        label        detail  
1 cat dog         dolphin        19           cat dog dolphin 19
2 dog cat         cat            72           dog cat cat 72
3 pilchard 26     koala          26           pilchard 26 koala 26
4 newt bat 81     bat            81           newt bat 81 bat 81

第1行很好，但其他行当然包含重复项，如上所述。

我想要的输出是：

  animal1         animal2        label        detail  
1 cat dog         dolphin        19           cat dog dolphin 19
2 dog cat         cat            72           dog cat 72
3 pilchard 26     koala          26           pilchard koala 26
4 newt bat 81     bat            81           newt bat 81

或者如果我只保留详细列中每行的每个单词/数字的第一个唯一实例，这也是合适的，例如：

  detail 
1 cat dog dolphin 19
2 dog cat 72
3 pilchard koala 26
4 newt bat 81

我已经看过为python中的字符串执行此操作，例如How can I remove duplicate words in a string with Python?，How to get all the unique words in the data frame?，show distinct column values in pyspark dataframe: python 但无法弄清楚如何将其应用于详细信息列中的各个行。在我将各列合并后，然后使用apply和lambda，我已经看过将文本拆分，但还没有让它工作。或者在组合列时可能有办法做到这一点吗？

我有solution in R但想要在python中重新编码。

非常感谢任何帮助或建议。我目前正在使用Spyder（Python 3.5）

Answer 1

您可以添加首先按空格分割的自定义函数，然后按pandas.unique获取唯一值，最后连接到字符串：

animals["detail"] = animals["animal1"].map(str) + ' ' + 
                    animals["animal2"].map(str) + ' ' +
                    animals["label"].map(str)

animals["detail"] = animals["detail"].apply(lambda x: ' '.join(pd.unique(x.split())))
print (animals)
       animal1  animal2  label              detail
1      cat dog  dolphin     19  cat dog dolphin 19
2      dog cat      cat     72          dog cat 72
3  pilchard 26    koala     26   pilchard 26 koala
4  newt bat 81      bat     81         newt bat 81

也可以在apply中加入值：

animals["detail"] = animals.astype(str)
                           .apply(lambda x: ' '.join(pd.unique(' '.join(x).split())),axis=1)
print (animals)
       animal1  animal2  label              detail
1      cat dog  dolphin     19  cat dog dolphin 19
2      dog cat      cat     72          dog cat 72
3  pilchard 26    koala     26   pilchard 26 koala
4  newt bat 81      bat     81         newt bat 81

使用set的解决方案，但它会更改顺序：

animals["detail"] = animals.astype(str)
                           .apply(lambda x: ' '.join(set(' '.join(x).split())), axis=1)
print (animals)
       animal1  animal2  label              detail
1      cat dog  dolphin     19  cat dolphin 19 dog
2      dog cat      cat     72          cat dog 72
3  pilchard 26    koala     26   26 pilchard koala
4  newt bat 81      bat     81         bat 81 newt

Answer 2

如果你想保持单词外观的顺序，你可以先在每一列中拆分单词，合并它们，删除重复项，最后将它们连接成一个新列。

df['detail'] = df.astype(str).T.apply(lambda x: x.str.split())
                 .apply(lambda x: ' '.join(pd.Series(sum(x,[])).drop_duplicates()))

df
Out[46]: 
         animal1   animal2   label                 detail
0      1 cat dog   dolphin       19  1 cat dog dolphin 19
1      2 dog cat       cat       72          2 dog cat 72
2  3 pilchard 26     koala       26   3 pilchard 26 koala
3  4 newt bat 81       bat       81         4 newt bat 81

Answer 3

我建议使用python set删除流程结尾处的重复项。

这是一个示例函数：

def dedup(value):
    words = set(value.split(' '))
    return ' '.join(words)

就是这样：

val = 'dog cat cat 81'
print dedup(val)

81只狗猫

如果您想要订购详细信息，可以使用collections或oredereddict中的pd.unique代替设置。

然后在您的详细信息列中只显示apply它（类似于map）以获得所需结果：

animals.detail = animals.detail.apply(dedup)

熊猫：合并后的列没有重复/合并

3 个答案: