我有一个2000行的csv文件,用pandas处理
id raw_value manual_raw_value
00037625-4706-4dfe-a7b3-de8c47e3a28d 3/ 3\
000b08e3-4129-4fd2-8ec0-23d00fe38a45 ok ok
002882ca-48bb-4161-a75a-cf0ec984d650 ab%cd 100%
005ce267-674a-418c-b0f6-7835fdf02219 14:17 14:17
0070ae6a-944b-4549-a229-00301cc96e29 6456 14762
00827aad-f737-4ec6-9881-988982662ad8 HT HT
008796d7-b21e-4b91-854f-1d163e336c05 Avenue Avenue
009dfaa8-5343-4345-8619-3010a1f77a03 1740 1740
00ad9cc7-c048-4d82-aa90-727d6eede4ea Total Total
00c46967-ee13-40ac-a4b4-4c0cf4186e90 ST ST
01167f7e-01eb-4033-b62b-92674ba40182 LA LA
013254c9-4353-45dc-9955-7520474803b7 zébra zébra
01662fca-8d52-40a6-be17-59e5e51c4ac2 31,40 31,40
01666c4c-8b9e-4081-9b9c-5c75f9a1736d 143.23 143.23
0167ac66-fcd5-43da-95fa-c38107860a8d restitut-ion res_titution
我想从这个csv中删除行(并将它们存储在新的csv文件中),如下所示:
raw_value
或manual_raw_value
此字符{ , ; : \ / . $ € % _ -}
é
è
和e
醇>
答案 0 :(得分:1)
将boolean indexing
与contains
创建的掩码和值|
(正则表达式or
),然后replace
和最后一次申请lower
一起使用{{3}}:
a = [ '\,', ';', '\:', '\\\\', '\/', '\.', '\$', '€', '\%', '_', '-']
joined = "|".join(a)
mask = ~df['raw_value'].str.contains(joined) |
~df['manual_raw_value'].str.contains(joined)
cols = ['raw_value','manual_raw_value']
df = df[mask].replace(['é','è'],'e', regex=True)
.apply(lambda x: x.str.lower())
.reset_index(drop=True)
print (df)
id raw_value manual_raw_value
0 000b08e3-4129-4fd2-8ec0-23d00fe38a45 ok ok
1 0070ae6a-944b-4549-a229-00301cc96e29 6456 14762
2 00827aad-f737-4ec6-9881-988982662ad8 ht ht
3 008796d7-b21e-4b91-854f-1d163e336c05 avenue avenue
4 009dfaa8-5343-4345-8619-3010a1f77a03 1740 1740
5 00ad9cc7-c048-4d82-aa90-727d6eede4ea total total
6 00c46967-ee13-40ac-a4b4-4c0cf4186e90 st st
7 01167f7e-01eb-4033-b62b-92674ba40182 la la
8 013254c9-4353-45dc-9955-7520474803b7 zebra zebra
答案 1 :(得分:0)
使用applymap可以做到这一点
df[['raw_value', 'manual_raw_value']] = df[['raw_value', 'manual_raw_value']][~df[['raw_value', 'manual_raw_value']].applymap(lambda x: any([xx in ['{', ',', ';', ':', '\\', '/', '.', '$', '€', '%','_','-','}'] for xx in x]))]
# here I apply applymap to the two columns and see if any of special characters are in each element. this will replace all the items of the two columns that have special character by NaN
df.dropna(axis = 0, how = 'any', inplace = True)
# Here I drop the NaN (all those values that have special character in them)
df = df.applymap(lambda x: x.lower())
# convert every value to lower case
df = df.applymap(lambda x: x.replace('é','è'))
# replace operation
print df
结果
id raw_value manual_raw_value
1 000b08e3-4129-4fd2-8ec0-23d00fe38a45 ok ok
4 0070ae6a-944b-4549-a229-00301cc96e29 6456 14762
5 00827aad-f737-4ec6-9881-988982662ad8 ht ht
6 008796d7-b21e-4b91-854f-1d163e336c05 avenue avenue
7 009dfaa8-5343-4345-8619-3010a1f77a03 1740 1740
8 00ad9cc7-c048-4d82-aa90-727d6eede4ea total total
9 00c46967-ee13-40ac-a4b4-4c0cf4186e90 st st
10 01167f7e-01eb-4033-b62b-92674ba40182 la la
11 013254c9-4353-45dc-9955-7520474803b7 zèbra zèbra