假设我有一个Pandas df
col_name
1 [16, 4, 30]
2 [5, 1, 2]
3 [4, 5, 52, 888]
4 [1, 2, 4]
5 [5, 99, 4, 75, 1, 2]
我想删除显示less than x
次的整个列中的所有元素,例如让我们取x = 3
这意味着我希望结果如下:
col_name
1 [4]
2 [5, 1, 2]
3 [4, 5]
4 [1, 2, 4]
5 [5, 4, 1, 2]
结果df基本上删除了数字16,30,52,888,99和75,因为它在列中显示的次数少于3次。
我尝试使用Counter
中的collections
,但它不起作用。
真的很感激,如果你能给我任何提示。提前谢谢。
答案 0 :(得分:3)
选项1
有点简单的香草方法
s = pd.Series({(i, j): x for (i, r) in df.col_name.items() for j, x in enumerate(r)})
f, u = pd.factorize(s.values)
s[(np.bincount(f) >= 3)[f]].groupby(level=0).apply(list).to_frame('col_name')
col_name
0 [4]
1 [5, 1, 2]
2 [4, 5]
3 [1, 2, 4]
4 [5, 4, 1, 2]
选项2
超过最复杂的方法
lens = df.col_name.str.len().values
splits = lens.cumsum()[:-1]
values = np.concatenate(df.col_name.values)
f, u = pd.factorize(values)
b = np.bincount(f)
r = np.arange(len(df)).repeat(lens)
m = (b >= 3)[f]
new_splits = splits - np.bincount(r, ~m).astype(int).cumsum()[:-1]
new_values = np.split(values[m], new_splits)
df.assign(col_name=new_values)
col_name
0 [4]
1 [5, 1, 2]
2 [4, 5]
3 [1, 2, 4]
4 [5, 4, 1, 2]
答案 1 :(得分:2)
首先获取counts
然后apply
或applymap
元素的条件检查。
In [2707]: counts = pd.Series([v for x in df.col_name for v in x]).value_counts()
In [2708]: df.col_name.apply(lambda x: [v for v in x if counts[v] >= 3])
Out[2708]:
1 [4]
2 [5, 1, 2]
3 [4, 5]
4 [1, 2, 4]
5 [5, 4, 1, 2]
Name: col_name, dtype: object
In [2709]: df.applymap(lambda x: [v for v in x if counts[v] >= 3])
Out[2709]:
col_name
1 [4]
2 [5, 1, 2]
3 [4, 5]
4 [1, 2, 4]
5 [5, 4, 1, 2]
详细
In [2710]: counts
Out[2710]:
4 4
5 3
2 3
1 3
30 1
888 1
52 1
16 1
75 1
99 1
dtype: int64
答案 2 :(得分:2)
您可以使用Counter()
中的collections
:
import pandas as pd
from collections import Counter
limit = 3
df = pd.DataFrame({'col_name': [[16, 4, 30], [5, 1, 2], [4, 5, 52, 888], [1, 2, 4], [5, 99, 4, 75, 1, 2]]})
flat = Counter([y for x in df.col_name for y in x])
desired = [k for k, v in flat.items() if v >= limit]
df['col_name'] = df['col_name'].apply(lambda x: [i for i in x if i in desired])
答案 3 :(得分:2)
您可以value_counts
使用boolean indexing
from itertools import chain
a = pd.Series(list(chain.from_iterable(df['col_name']))).value_counts()
a = a.index[a >= 3]
print (a)
Int64Index([4, 5, 2, 1], dtype='int64')
df = df.applymap(lambda x: [v for v in x if v in a])
print (df)
col_name
1 [4]
2 [5, 1, 2]
3 [4, 5]
4 [1, 2, 4]
5 [5, 4, 1, 2]
答案 4 :(得分:2)
与this类似,使用collections.Counter
(但是已经独立开发,只进行了一些优化);
from collections import Counter
c = Counter(pd.Series(np.concatenate(df.col_name.tolist())))
def foo(array):
return [x for x in array if c[x] >= 3]
df.col_name = df.col_name.apply(foo)
df
col_name
1 [4]
2 [5, 1, 2]
3 [4, 5]
4 [1, 2, 4]
5 [5, 4, 1, 2]