我正在处理人的一些实时数据,而dataFrame的age列确实很乱。 我希望期望输出为age_bins,范围为[0,10,20,30,40,50,60,70,80,90,100]。
清除这种混乱数据的最佳方法是什么?
df = pd.DataFrame({'Age':['23', '64', '71', '53', '40', '45', '30-39', '50-59', '60-69',
'30', '65', '44', '8-68', '21-72', '26', '36', '43', '70', '52',
'66', '27', '17', '51', '68', '35', '28', '58', '33', '31', '50',
'24', '88', '29', '21', '78', '60', '63', '37', '32', '49',
'20-29', '47', '18-99', '41', '39', '42', '38', '7', '40-49', '82',
'61', '34-66', '62', '40-89', '80-89', '55', '0.25', '13-19', '69',
'16', '8', '10', '25', '34', '55-74', '75-', '70-79', '79',
'35-54', '55-', '95', '54', '40-50', '46', '48', '57', '56']})
答案 0 :(得分:1)
您可以用Series.str.split
拆分值,并用Series.str.strip
删除可能的对等-
到2列,并为每列使用cut
:
df1 = df['Age'].str.strip('-').str.split('-', expand=True).astype(float)
bins = [0,10,20,30,40,50,60,70,80,90,100]
labels = ['{}-{}'.format(i, j-1) for i, j in zip(bins[:-1], bins[1:])]
g1 = pd.cut(df1[0], bins=bins, right=False, labels=labels)
g2 = pd.cut(df1[1], bins=bins, right=False, labels=labels)
然后比较两者和如果匹配(在Series
之间也替换为丢失的值),然后由Series.mask
创建新列:
df['age_bins'] = g1.mask(g1.ne(g2.fillna(g1)))
print (df)
Age age_bins
0 23 20-29
1 64 60-69
2 71 70-79
3 53 50-59
4 40 40-49
.. ... ...
72 40-50 NaN
73 46 40-49
74 48 40-49
75 57 50-59
76 56 50-59
[77 rows x 2 columns]
不匹配的值:
df1 = df[df['age_bins'].isna()]
print (df1)
Age age_bins
12 8-68 NaN
13 21-72 NaN
42 18-99 NaN
51 34-66 NaN
53 40-89 NaN
64 55-74 NaN
68 35-54 NaN
72 40-50 NaN