Question

我有一个包含2列的数据框，其中之一是用逗号分隔的值列表：

A  1,2,3,4,6

B  1,5,6,7

C  1,3,2,8,9,7

D  1,3,6,8

我还有一个数组：[2,3,9]

最后，我想使用相同的数据帧进行转换，以将不在数组中的值过滤掉。例如：

A  2,3

B  

C  3,2,9

D  3

有人能指出我正确的方向吗？我环顾四周，但碰壁了一点。

Answer 1

设置

import re

df = pd.DataFrame({
    'col1': ['A', 'B', 'C', 'D'],
    'col2': ['1,2,3,4,6', '1,5,6,7', '1,3,2,8,9,7', '1,3,6,8']
})

good = [str(i) for i in [2,3,9]]

我们可以使用正则表达式和re.findall提取所有可接受的值，我们只需要断言匹配项不直接跟在数字后面或前面，这样我们就不会匹配中间的数字另一个号码：

rgx = '(?<!\d)({})(?!\d)'.format('|'.join(good))

df.assign(out=[','.join(re.findall(rgx, row)) for row in df.col2])

  col1         col2    out
0    A    1,2,3,4,6    2,3
1    B      1,5,6,7
2    C  1,3,2,8,9,7  3,2,9
3    D      1,3,6,8      3

正则表达式说明

(?<!                 # Negative lookbehind
  \d                 # Asserts previous character is *not* a digit
)           
(                    # Matching group
  2|3|9              # Matches either 2 or 3 or 9
) 
(?!                  # Negative lookahead
  \d                 # Asserts the following character is *not* a digit
)

Answer 2

我认为套用方法提供了一种易于理解的解决方案。

allowed = [2, 3, 9]
allowed_string = [str(x) for x in allowed]
df[1] = df[1].str.split(',')
df[1] = df[1].apply(lambda x: [y for y in x if y in allowed_string])

输出：

   0          1
0  A     [2, 3]
1  B         []
2  C  [3, 2, 9]
3  D        [3]

Answer 3

您可以使用isin（）方法过滤列值。参见下面的示例。

import pandas as pd

data = {'A':[1,2,3,4,6],
        'B':[1,5,6,7],
        'C':[1,3,2,8,9,7],
        'D':[1,3,6,8]}

allow_list = [2,3,9]    #list of allowed elements

df = pd.concat([pd.Series(val, name=key) for key, val in data.items()], axis=1)

df1=df[df[df.columns].isin(allow_list)]    #provide list of allowed elements as parameter in isin method
df1.dropna(how='all',inplace=True)    #remove rows which are all NaN
print(df1)

输出：

     A    C   B    D
1  2.0  3.0 NaN  3.0
2  3.0  2.0 NaN  NaN
4  NaN  9.0 NaN  NaN

根据数据框中的特定值是否在数组中来删除它们

3 个答案: