Question

假设我有一个单词列表，如

c= ('an', 'abc', 'pls')

然后我在数据框中有一列

df['column']

another
fan
Ind
abcd
point
plsea

我想检查c中的值是否存在于df ['column']中，我想将其设为NA。我需要得到的输出是，

NA
NA
Ind
NA
point
NA

以下是我的尝试，

c in df['column']
False

仅获取第一行。无法做到这一点。有人可以帮我这么做吗？

Answer 1

如果您'|'加入c中的所有字词以生成正则表达式模式，那么您可以将其传递给str.contains并将所有匹配项设置为'NA'：

In [21]:
df.loc[df['words'].str.contains('|'.join(c)),'words'] = 'NA'
df

Out[21]:
   words
0     NA
1     NA
2    Ind
3     NA
4  point
5     NA

以下是中间步骤的输出：

In [23]:
'|'.join(c)

Out[23]:
'an|abc|pls'

In [24]:
df['words'].str.contains('|'.join(c))

Out[24]:
0     True
1     True
2    False
3     True
4    False
5     True
Name: words, dtype: bool

Answer 2

可能有一个特定的pandas方法，但是只使用纯python，你会迭代列中的每个值，然后检查c中是否有任何单词出现在其中，

for idx, value in enumerate(df['column']):
    if any([word in value for word in c]):
        df['column'][idx] = 'NA'

Answer 3

将round(latValue * 1000000.0) / 1000000.0与lambda表达式一起使用：

apply()

以下是一个例子：

df['column'].apply(lambda x: 'NA' if any(s in x for s in c) else x)

如果要更新原始数据框：

import pandas as pd

c = ('an', 'abc', 'pls')
df = pd.DataFrame([[1,2,'another'],[3,4,'fan'],[5,6,'Ind'],[0,0,'abcd'],[1,2,'point'],[22,44,'plsea']])
df.columns = ['A', 'B', 'C']

>>> df['C'].apply(lambda x: 'NA' if any(s in x for s in c) else x)
0       NA
1       NA
2      Ind
3       NA
4    point
5       NA
Name: C, dtype: object

会这样做。

Answer 4

您可以将数据框replace函数与正则表达式numpy.NAN一起使用。这部分替换也可以提供NAN值。然后，您可以使用fillna使用NA填充替换值：

df['column']=df['column'].replace('|'.join(c),np.NAN,regex=True).fillna('NA')

这是我创建的示例数据框：

import numpy as np
import pandas as pd
c= ('an', 'abc', 'pls')
data=['another','fan','Ind','abcd','point','plsea']    
df = pd.DataFrame(data)
df.columns=['column']

这是df的输出：

0       NA
1       NA
2      Ind
3       NA
4    point
5       NA

Answer 5

我认为你可以使用这种模式：

c= ('an', 'abc', 'pls')

df = ('another', 'fan', 'Ind', 'abcd', 'point', 'plsea')

for x in df:
    for y in c:
        print(y in x)

如果单词中存在字符串，则在python中将其设为NA

5 个答案: