我有这个数据集:
frame = pd.DataFrame({'col_a' : [np.nan, 'in millions', 'millions', 'in thousands', 'thousands', np.nan, 'thousands', 'abcdef'],
'col_b' : ['2009', '2009', '2009', '2009', '2009', '2009', 'abc', '2009'],
'col_c' : ['2010', '2010', '2010', '2010', '2010', '2010', 'def', '2010'],
'col_d' : [np.nan, np.nan, np.nan, np.nan, np.nan, 'thousands', np.nan, np.nan]})
制作:
col_a col_b col_c col_d
0 NaN 2009 2010 NaN
1 in millions 2009 2010 NaN
2 millions 2009 2010 NaN
3 in thousands 2009 2010 NaN
4 thousands 2009 2010 NaN
5 NaN 2009 2010 thousands
6 thousands abc def NaN
7 abcdef 2009 2010 NaN
我想为每一行过滤该数据框
\s*?\d{4}\s*?
。millions|thousands
中都具有“百万”或“千”或为NaN,但不是“百万”或“千”以外的字符串。也就是说,我想要第0、1、2、3、4、5行。
我这样做:
mask = frame.astype(str).apply(lambda x: x.str.contains(r'\s*?\d{4}\s*?',
regex = True,
flags = re.IGNORECASE,
na = False)).any(axis = 1)
test = frame[mask]
mask = test.astype(str).apply(lambda x: x.str.contains(r'(in)?millions|thousands',
regex = True,
flags = re.IGNORECASE,
na = False)).any(axis = 1)
test = test[mask]
test
哪个给:
col_a col_b col_c col_d
1 in millions 2009 2010 NaN
2 millions 2009 2010 NaN
3 in thousands 2009 2010 NaN
4 thousands 2009 2010 NaN
5 NaN 2009 2010 thousands
行0未命中已过滤的数据帧,因为它在col_a和col_d中具有NaN。而且,Python会发出警告:
UserWarning:布尔系列键将被重新索引以匹配DataFrame索引。
#加载内容时,从sys.path中删除CWD。 p>
如何将其更改为也包含第0行?如果将第二个正则表达式更改为(in)?millions|thousands|NaN
,我还将得到第6行和第7行,这不是我想要的。
编辑:在此数据集中,我知道col_a和col_d包含第0行的NaN。在实际数据集中,我不知道NaN出现在哪一列。也就是说,更一般的过滤条件是:
答案 0 :(得分:0)
如果有人有类似的问题:我发现负面的前瞻对我有用:
mask = frame.apply(lambda x: x.str.contains(r'^(?!.*?thousand|.*?million|\d{4}).*?$',
regex = True,
flags = re.IGNORECASE)).any(axis = 1)
test = frame[~mask]
test
产生:
col_a col_b col_c col_d
0 NaN 2009 2010 NaN
1 in millions 2009 2010 NaN
2 millions 2009 2010 NaN
3 in thousands 2009 2010 NaN
4 thousands 2009 2010 NaN
5 NaN 2009 2010 thousands