删除行并替换pandas中的char

时间:2017-05-18 15:57:09

标签: python pandas dataframe

我有一个2000行的csv文件,用pandas处理

id                                          raw_value   manual_raw_value
00037625-4706-4dfe-a7b3-de8c47e3a28d        3/          3\
000b08e3-4129-4fd2-8ec0-23d00fe38a45        ok          ok
002882ca-48bb-4161-a75a-cf0ec984d650        ab%cd       100%
005ce267-674a-418c-b0f6-7835fdf02219        14:17       14:17
0070ae6a-944b-4549-a229-00301cc96e29        6456        14762
00827aad-f737-4ec6-9881-988982662ad8        HT          HT
008796d7-b21e-4b91-854f-1d163e336c05        Avenue      Avenue
009dfaa8-5343-4345-8619-3010a1f77a03        1740        1740
00ad9cc7-c048-4d82-aa90-727d6eede4ea        Total       Total
00c46967-ee13-40ac-a4b4-4c0cf4186e90        ST          ST
01167f7e-01eb-4033-b62b-92674ba40182        LA          LA
013254c9-4353-45dc-9955-7520474803b7        zébra       zébra
01662fca-8d52-40a6-be17-59e5e51c4ac2        31,40       31,40
01666c4c-8b9e-4081-9b9c-5c75f9a1736d        143.23      143.23
0167ac66-fcd5-43da-95fa-c38107860a8d        restitut-ion    res_titution

我想从这个csv中删除行(并将它们存储在新的csv文件中),如下所示:

  1. 删除raw_valuemanual_raw_value此字符{ , ; : \ / . $ € % _ -}
  2. 中包含的每一行
  3. 将所有字母设为小写
  4. é
  5. 替换èe

2 个答案:

答案 0 :(得分:1)

boolean indexingcontains创建的掩码和值|(正则表达式or),然后replace和最后一次申请lower一起使用{{3}}:

a = [ '\,', ';', '\:', '\\\\', '\/', '\.', '\$', '€', '\%', '_', '-']
joined = "|".join(a)

mask = ~df['raw_value'].str.contains(joined) | 
       ~df['manual_raw_value'].str.contains(joined)
cols = ['raw_value','manual_raw_value'] 
df = df[mask].replace(['é','è'],'e', regex=True) 
             .apply(lambda x: x.str.lower()) 
             .reset_index(drop=True)
print (df)
                                     id raw_value manual_raw_value
0  000b08e3-4129-4fd2-8ec0-23d00fe38a45        ok               ok
1  0070ae6a-944b-4549-a229-00301cc96e29      6456            14762
2  00827aad-f737-4ec6-9881-988982662ad8        ht               ht
3  008796d7-b21e-4b91-854f-1d163e336c05    avenue           avenue
4  009dfaa8-5343-4345-8619-3010a1f77a03      1740             1740
5  00ad9cc7-c048-4d82-aa90-727d6eede4ea     total            total
6  00c46967-ee13-40ac-a4b4-4c0cf4186e90        st               st
7  01167f7e-01eb-4033-b62b-92674ba40182        la               la
8  013254c9-4353-45dc-9955-7520474803b7     zebra            zebra

答案 1 :(得分:0)

使用applymap可以做到这一点

df[['raw_value', 'manual_raw_value']] =  df[['raw_value', 'manual_raw_value']][~df[['raw_value', 'manual_raw_value']].applymap(lambda x: any([xx in ['{', ',', ';', ':', '\\', '/', '.', '$', '€', '%','_','-','}'] for xx in x]))]

# here I apply applymap to the two columns and see if any of special characters are in each element. this will replace all the items of the two columns that have special character by NaN

df.dropna(axis = 0, how = 'any', inplace = True)
# Here I drop the NaN (all those values that have special character in them)


df = df.applymap(lambda x: x.lower())
# convert every value to lower case

df = df.applymap(lambda x: x.replace('é','è'))
# replace operation

print df

结果

                                      id raw_value manual_raw_value
1   000b08e3-4129-4fd2-8ec0-23d00fe38a45        ok               ok
4   0070ae6a-944b-4549-a229-00301cc96e29      6456            14762
5   00827aad-f737-4ec6-9881-988982662ad8        ht               ht
6   008796d7-b21e-4b91-854f-1d163e336c05    avenue           avenue
7   009dfaa8-5343-4345-8619-3010a1f77a03      1740             1740
8   00ad9cc7-c048-4d82-aa90-727d6eede4ea     total            total
9   00c46967-ee13-40ac-a4b4-4c0cf4186e90        st               st
10  01167f7e-01eb-4033-b62b-92674ba40182        la               la
11  013254c9-4353-45dc-9955-7520474803b7     zèbra            zèbra