我有一个数据框,我想在其中删除具有重复ID的行。在大多数情况下,ID是整数和字符串。但是,某些ID条目是多个ID的列表。我无法拆分这些列表,但是在尝试删除重复项时出现错误。作为参考,我使用了df = df['ID'].astype(str)
,它对下面显示的错误没有影响。
df的代码:
d = {'ID': [999,
123,
F41,
99W21,
662,
123,
[552, F430, R111],
44482,
F41,
[M192, 5527, 7890, 111120]
]}
df = pd.Dataframe(data=d)
输入df ID列类似于:
Index ID
-------------
0 999
1 123
2 F41
3 99W21
4 662
5 123
6 [552, F430, R111]
7 44482
8 F41
9 [M192, 5527, 7890, 111120]
我想删除重复项,使输出为:
Index ID
-------------
0 999
1 123
2 F41
3 99W21
4 662
5 [552, F430, R111]
6 44482
7 [M192, 5527, 7890, 111120]
我尝试过df.drop_duplicates(subset=['ID'], inplace=True)
,但出现错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-13-0186aa1e1043> in <module>
3 # Reset index and drop CID duplicates
----> 4 df.drop_duplicates(subset=['ID'], inplace=True)
5 df.reset_index(drop=True, inplace=True)
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in drop_duplicates(self, subset, keep, inplace)
4907
4908 inplace = validate_bool_kwarg(inplace, "inplace")
-> 4909 duplicated = self.duplicated(subset, keep=keep)
4910
4911 if inplace:
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in duplicated(self, subset, keep)
4967
4968 vals = (col.values for name, col in self.items() if name in subset)
-> 4969 labels, shape = map(list, zip(*map(f, vals)))
4970
4971 ids = get_group_index(labels, shape, sort=False, xnull=False)
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in f(vals)
4945 def f(vals):
4946 labels, shape = algorithms.factorize(
-> 4947 vals, size_hint=min(len(self), _SIZE_HINT_LIMIT)
4948 )
4949 return labels.astype("i8", copy=False), len(shape)
/usr/local/lib/python3.6/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
206 else:
207 kwargs[new_arg_name] = new_arg_value
--> 208 return func(*args, **kwargs)
209
210 return wrapper
/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py in factorize(values, sort, order, na_sentinel, size_hint)
670
671 labels, uniques = _factorize_array(
--> 672 values, na_sentinel=na_sentinel, size_hint=size_hint, na_value=na_value
673 )
674
/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py in _factorize_array(values, na_sentinel, size_hint, na_value)
506 table = hash_klass(size_hint or len(values))
507 uniques, labels = table.factorize(
--> 508 values, na_sentinel=na_sentinel, na_value=na_value
509 )
510
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.factorize()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable._unique()
TypeError: unhashable type: 'list'
还有df = pd.DataFrame(np.unique(df), columns=df.columns)
,它给出了错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-14-5b335a526fd5> in <module>
3 # Reset index and drop CID duplicates
----> 4 df = pd.DataFrame(np.unique(df), columns=df.columns)
5 df.reset_index(drop=True, inplace=True)
<__array_function__ internals> in unique(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
260 ar = np.asanyarray(ar)
261 if axis is None:
--> 262 ret = _unique1d(ar, return_index, return_inverse, return_counts)
263 return _unpack_tuple(ret)
264
/usr/local/lib/python3.6/dist-packages/numpy/lib/arraysetops.py in _unique1d(ar, return_index, return_inverse, return_counts)
308 aux = ar[perm]
309 else:
--> 310 ar.sort()
311 aux = ar
312 mask = np.empty(aux.shape, dtype=np.bool_)
TypeError: '<' not supported between instances of 'float' and 'str'
如果有解决办法,我不确定它是什么,所以任何帮助都将是有用的。
答案 0 :(得分:0)
无法散列的类型:“列表”错误表示熊猫试图将列表用作散列参数。
尝试将列转换为字符串并删除重复项。并将其更改回数据框
df = df['ID'].astype(str).drop_duplicates().to_frame()