感谢您的解决方案。但是,当我尝试将其应用于数据时,以使列标题在搜索和替换无关值时不受影响。这是我的数据框。请协助。
df = pd.DataFrame({'Date_sampled': ['8/31/2018 0:00',
'9/31/2018 12:00:00 AM', '2/31/2018 12:00:00 AM', '2/31/2018 12:00:00 AM', '12/31/2018 0:00',
'12/31/2018 0:00', '12/31/2018 0:00', '6/31/2018 12:00:00 AM', '2/31/2018 12:00:00 AM',
'2/31/2018 12:00:00 AM', '12/31/2018 0:00', '12/31/2018 0:00'], 'apple18:apple1': ['15.8',
'27.84883300816733\\U', '27.68303400840678\\O', '???', '?????', '67.61', '27.33',
'37.73069872941176\\M', '37.98761171079137\\F', '10.2\\I', '10.1\\Y', '67.61'],
'Orange:ripe': ['89.59', '44.64197389840307\\Y', '39.93121897299962\\W', '7.2\\K',
'6.0\\Y', '9.19', '18.62', '???', '???', '7.2\\T', '7.0\\D', '79.1'], 'Banana': ['51.36', '?????',
'???', '23.77814972104277\\T', '27.80709611086276\\N', '13.3\\T', '31.27', '?????', '???',
'17.3\\H', '16.4\\E', '11.36'], 'Egg24:Eg17 (Toasted:Scrammed)': ['17.98', '13.3\\T', '9.4\\J',
'2396,7', 'nan', '14', 'None', 'None', '14.8', '44.64197349440307\\Y', '39.93151497599965\\W',
'-'], 'Bread(white)': ['23.24', '6.1\\Q', '7.2\\K', 'None', 'None', '20', 'None', 'None', '20.4', '3473,3',
'1606,3', '47,7'], 'Potato:24': ['-', '-', '-', '-', 'nan', 'nan', 'nan', '343.859844\\OP', '56.06332588\\RS',
'75.1973942\\ZTO', 'nan', '-']})
答案 0 :(得分:0)
我相信您需要使用Series.str.replace
用Series.str.extract
提取数值:
d ={'apple': ['15.8', '356,2', '51.36', '17986,8','6.0\\tY', 'Null'],
'banana': ['27.84883300816733\\U', 'Z44.64197389840307\\Y', '?????', '13.3\\T', 'p17.6', '6.1\\Q'],
'cheese': ['27.68303400840678\\O', '39.93121897299962\\W', '???', '9.4\\J', '7.2\\K','6.0\\Y'],
'egg': ['???', '7.2\\K', '66.0\\p','23.77814972104277\\T', '2396,7', 'None']}
df = pd.DataFrame(d)
print (df)
apple banana cheese egg
0 15.8 27.84883300816733\U 27.68303400840678\O ???
1 356,2 Z44.64197389840307\Y 39.93121897299962\W 7.2\K
2 51.36 ????? ??? 66.0\p
3 17986,8 13.3\T 9.4\J 23.77814972104277\T
4 6.0\tY p17.6 7.2\K 2396,7
5 Null 6.1\Q 6.0\Y None
#https://stackoverflow.com/a/28832504/2901002
pat = r"(\d+\.*\d*)"
df = df.apply(lambda x: x.str.replace(',','.').str.extract(pat, expand=False))
print (df)
apple banana cheese egg
0 15.8 27.84883300816733 27.68303400840678 NaN
1 356.2 44.64197389840307 39.93121897299962 7.2
2 51.36 NaN NaN 66.0
3 17986.8 13.3 9.4 23.77814972104277
4 6.0 17.6 7.2 2396.7
5 NaN 6.1 6.0 NaN
最后可能会转换为浮点数:
df = df.apply(lambda x: x.str.replace(',','.').str.extract(pat, expand=False)).astype(float)
print (df)
apple banana cheese egg
0 15.80 27.848833 27.683034 NaN
1 356.20 44.641974 39.931219 7.20000
2 51.36 NaN NaN 66.00000
3 17986.80 13.300000 9.400000 23.77815
4 6.00 17.600000 7.200000 2396.70000
5 NaN 6.100000 6.000000 NaN
答案 1 :(得分:0)
首先,您最好确保这些“外部”字符不是更大问题的征兆-生成此数据的任何内容都是垃圾!
import re
for k in df.keys():
df[k] = [re.sub('[^0-9.]','',str) for str in df[k]]