Question

我正在从csv文件的DataFrame中导入具有1280个（因此我认为）唯一ID的列。

我计划将每个ID放入字典中作为键，并将值设置为“ 0”。然后将所有内容放入新的DataFrame中。

当从DataFrame中提取列作为列表时，我注意到该数目减少到了1189，而不是1280。

我认为，原始DataFrame中必须有重复项。令人惊讶的是，这些ID应该是唯一的ID。我可以采取捷径，仅将列表用于新的DataFrame。但是，至关重要的是，我要弄清楚发生了什么，并确定是否存在重复项。

唯一的问题是，我无法识别任何重复项。我不知道可能是什么问题。

import pandas as pd
from itertools import cycle

DF0 = pd.read_csv("FILENAME.csv", sep='$', encoding='utf-8-sig')

l_o_0 = ['0']

l_DF0 = list(DF0['Short_ID'])
print('  len of origin object   '+str(len(DF0['Short_ID'])))
print('            l_DF0 is a   '+str(type(l_DF0)))
print('                of len   '+str(len(l_DF0))+'\n')

d_DF0 = dict(zip(DF0['Short_ID'], cycle(l_o_0)))
print('  len of origin object   '+str(len(DF0['Short_ID'])))
print('            d_DF0 is a   '+str(type(d_DF0)))
print('                of len   '+str(len(d_DF0))+'\n')

print('           difference:   '+(str(len(DF0['Short_ID'])-len(d_DF0)))+'\n')

s_DF0 = set(l_DF0)
print('            s_DF0 is a   '+str(type(s_DF0)))
print('             of length   '+str(len(s_DF0))+'\n')

red_l_DF0 = list(s_DF0)
print('        red_l_DF0 is a   '+str(type(red_l_DF0)))
print('             of length   '+str(len(red_l_DF0))+'\n')

l_prob = []
for item in l_DF0:
    if item not in red_l_DF0:
        l_prob.append(item)
print(len(l_prob))

输出为：

  len of origin object   1280
            l_DF0 is a   <class 'list'>
                of len   1280

  len of origin object   1280
            d_DF0 is a   <class 'dict'>
                of len   1189

           difference:   91

            s_DF0 is a   <class 'set'>
             of length   1189

        red_l_DF0 is a   <class 'list'>
             of length   1189

           l_prob is a   <class 'list'>
             of length   0
>>>

我根据在这里找到的内容尝试了上述方法：
Python list subtraction operation
我没有正确使用工具，或者这是错误的工具。任何帮助将不胜感激-预先感谢！

Answer 1

使用熊猫的duplicated函数：

duplicated_stuff = DF0[DF0['Short_ID'].duplicated()]

根据您要查看的内容，更改重复项的keep参数。对于您的调试，您可能需要keep=False。

如何识别列表中的重复字符串？

1 个答案: