从python中的字典中删除无关的值

时间:2019-03-10 09:17:13

标签: python string dataframe

感谢您的解决方案。但是,当我尝试将其应用于数据时,以使列标题在搜索和替换无关值时不受影响。这是我的数据框。请协助。

df = pd.DataFrame({'Date_sampled': ['8/31/2018 0:00',
  '9/31/2018  12:00:00 AM',  '2/31/2018  12:00:00 AM',  '2/31/2018  12:00:00 AM',  '12/31/2018 0:00',
  '12/31/2018 0:00',  '12/31/2018 0:00',  '6/31/2018 12:00:00 AM',  '2/31/2018  12:00:00 AM',
  '2/31/2018  12:00:00 AM',  '12/31/2018 0:00',  '12/31/2018 0:00'], 'apple18:apple1': ['15.8',
  '27.84883300816733\\U',  '27.68303400840678\\O',  '???',  '?????',  '67.61',  '27.33',
  '37.73069872941176\\M',  '37.98761171079137\\F',  '10.2\\I',  '10.1\\Y',  '67.61'],
'Orange:ripe': ['89.59',  '44.64197389840307\\Y',  '39.93121897299962\\W',  '7.2\\K',
  '6.0\\Y',  '9.19',  '18.62',  '???',  '???',  '7.2\\T',  '7.0\\D',  '79.1'], 'Banana': ['51.36',  '?????',
  '???',  '23.77814972104277\\T',  '27.80709611086276\\N',  '13.3\\T',  '31.27',  '?????',  '???',
  '17.3\\H',  '16.4\\E',  '11.36'], 'Egg24:Eg17 (Toasted:Scrammed)': ['17.98',  '13.3\\T',  '9.4\\J',
  '2396,7',  'nan',  '14',  'None',  'None',  '14.8',  '44.64197349440307\\Y',  '39.93151497599965\\W',
  '-'], 'Bread(white)': ['23.24',  '6.1\\Q',  '7.2\\K',  'None',  'None',  '20',  'None',  'None',  '20.4',  '3473,3',
  '1606,3',  '47,7'], 'Potato:24': ['-',  '-',  '-',  '-',  'nan',  'nan',  'nan',  '343.859844\\OP',  '56.06332588\\RS',
  '75.1973942\\ZTO',  'nan',  '-']})

2 个答案:

答案 0 :(得分:0)

我相信您需要使用Series.str.replaceSeries.str.extract提取数值:

d ={'apple': ['15.8', '356,2', '51.36', '17986,8','6.0\\tY', 'Null'],
    'banana': ['27.84883300816733\\U', 'Z44.64197389840307\\Y', '?????', '13.3\\T', 'p17.6', '6.1\\Q'],
    'cheese': ['27.68303400840678\\O', '39.93121897299962\\W', '???', '9.4\\J', '7.2\\K','6.0\\Y'], 
    'egg': ['???', '7.2\\K', '66.0\\p','23.77814972104277\\T', '2396,7', 'None']}

df = pd.DataFrame(d)
print (df)
     apple                banana               cheese                  egg
0     15.8   27.84883300816733\U  27.68303400840678\O                  ???
1    356,2  Z44.64197389840307\Y  39.93121897299962\W                7.2\K
2    51.36                 ?????                  ???               66.0\p
3  17986,8                13.3\T                9.4\J  23.77814972104277\T
4   6.0\tY                 p17.6                7.2\K               2396,7
5     Null                 6.1\Q                6.0\Y                 None

#https://stackoverflow.com/a/28832504/2901002
pat = r"(\d+\.*\d*)"
df = df.apply(lambda x: x.str.replace(',','.').str.extract(pat, expand=False))
print (df)

     apple             banana             cheese                egg
0     15.8  27.84883300816733  27.68303400840678                NaN
1    356.2  44.64197389840307  39.93121897299962                7.2
2    51.36                NaN                NaN               66.0
3  17986.8               13.3                9.4  23.77814972104277
4      6.0               17.6                7.2             2396.7
5      NaN                6.1                6.0                NaN

最后可能会转换为浮点数:

df = df.apply(lambda x: x.str.replace(',','.').str.extract(pat, expand=False)).astype(float)
print (df)
      apple     banana     cheese         egg
0     15.80  27.848833  27.683034         NaN
1    356.20  44.641974  39.931219     7.20000
2     51.36        NaN        NaN    66.00000
3  17986.80  13.300000   9.400000    23.77815
4      6.00  17.600000   7.200000  2396.70000
5       NaN   6.100000   6.000000         NaN

答案 1 :(得分:0)

首先,您最好确保这些“外部”字符不是更大问题的征兆-生成此数据的任何内容都是垃圾!

import re
for k in df.keys():
    df[k] = [re.sub('[^0-9.]','',str) for str in df[k]]