我的熊猫数据框具有以下结构:
Col1 | Col2 | Col3
-------+---------------------+--------------
0 6 | [a,b,c,d,e,f] | ....
1 4 | [a,g,h,i] | ....
2 5 | [a,b,j,k,l] | ....
我有一个元素列表,必须从Col2 [a,b,h]
最后我需要将其翻译为
Col1 | Col2 | Col3
-------+-----------------+--------------
0 4 | [c,d,e,f] | ....
1 2 | [g,i] | ....
2 3 | [j,k,l] | ....
Col1
是Col2
中元素的数量
我尝试过
def modify_data(dataset):
ds = dataset.copy()
Col2 = dataset['Col2']
remove_list = [a,b,h]
removed_col2 = []
counts = []
for i,row in enumerate(Col2):
cleaned = np.array(list(set(row)-set(remove_list)))
removed_col2.append(cleaned)
counts.append(len(cleaned))
ds.loc[:,'Col1'] = counts
ds.loc[:,'Col2'] = removed_col2
return ds
但是性能太差了。例如,对于具有200,000行的数据集。
CPU times: user 11min 26s, sys: 24.2 s, total: 11min 50s
Wall time: 11min 48s
答案 0 :(得分:3)
我会尝试的
df.Col2 = (df.Col2.map(set)-set(['a','b','h'])).map(list)
df.Col1 = df.Col2.str.len()
df
Out[111]:
Col2 Col1
0 [f, e, c, d] 4
1 [g, i] 2
2 [j, k, l] 3
答案 1 :(得分:1)
另一种解决方案,使用list comprehension
:
df = pd.DataFrame(
{
"col1": [6, 4, 3],
"col2": [
["a", "b", "c", "d", "e", "f"],
["a", "g", "h", "i"],
["a", "b", "j", "k", "l"],
],
}
)
df['col2'] = [[value for value in entry
if value not in ('a','b','h')]
for entry in df.col2
]
df['col1'] = df.col2.str.len()
col1 col2
0 4 [c, d, e, f]
1 2 [g, i]
2 3 [j, k, l]