我有一个包含2列的数据框,其中之一是用逗号分隔的值列表:
A 1,2,3,4,6
B 1,5,6,7
C 1,3,2,8,9,7
D 1,3,6,8
我还有一个数组:[2,3,9]
最后,我想使用相同的数据帧进行转换,以将不在数组中的值过滤掉。例如:
A 2,3
B
C 3,2,9
D 3
有人能指出我正确的方向吗?我环顾四周,但碰壁了一点。
答案 0 :(得分:0)
设置
import re
df = pd.DataFrame({
'col1': ['A', 'B', 'C', 'D'],
'col2': ['1,2,3,4,6', '1,5,6,7', '1,3,2,8,9,7', '1,3,6,8']
})
good = [str(i) for i in [2,3,9]]
我们可以使用正则表达式和re.findall
提取所有可接受的值,我们只需要断言匹配项不直接跟在数字后面或前面,这样我们就不会匹配中间的数字另一个号码:
rgx = '(?<!\d)({})(?!\d)'.format('|'.join(good))
df.assign(out=[','.join(re.findall(rgx, row)) for row in df.col2])
col1 col2 out
0 A 1,2,3,4,6 2,3
1 B 1,5,6,7
2 C 1,3,2,8,9,7 3,2,9
3 D 1,3,6,8 3
正则表达式说明
(?<! # Negative lookbehind
\d # Asserts previous character is *not* a digit
)
( # Matching group
2|3|9 # Matches either 2 or 3 or 9
)
(?! # Negative lookahead
\d # Asserts the following character is *not* a digit
)
答案 1 :(得分:0)
我认为套用方法提供了一种易于理解的解决方案。
allowed = [2, 3, 9]
allowed_string = [str(x) for x in allowed]
df[1] = df[1].str.split(',')
df[1] = df[1].apply(lambda x: [y for y in x if y in allowed_string])
输出:
0 1
0 A [2, 3]
1 B []
2 C [3, 2, 9]
3 D [3]
答案 2 :(得分:0)
您可以使用isin()方法过滤列值。参见下面的示例。
import pandas as pd
data = {'A':[1,2,3,4,6],
'B':[1,5,6,7],
'C':[1,3,2,8,9,7],
'D':[1,3,6,8]}
allow_list = [2,3,9] #list of allowed elements
df = pd.concat([pd.Series(val, name=key) for key, val in data.items()], axis=1)
df1=df[df[df.columns].isin(allow_list)] #provide list of allowed elements as parameter in isin method
df1.dropna(how='all',inplace=True) #remove rows which are all NaN
print(df1)
输出:
A C B D
1 2.0 3.0 NaN 3.0
2 3.0 2.0 NaN NaN
4 NaN 9.0 NaN NaN