熊猫清理凌乱的数据

时间:2020-03-25 10:55:59

标签: python pandas

我正在处理人的一些实时数据,而dataFrame的age列确实很乱。 我希望期望输出为age_bins,范围为[0,10,20,30,40,50,60,70,80,90,100]。

清除这种混乱数据的最佳方法是什么?

df = pd.DataFrame({'Age':['23', '64', '71', '53', '40', '45', '30-39', '50-59', '60-69',
       '30', '65', '44', '8-68', '21-72', '26', '36', '43', '70', '52',
       '66', '27', '17', '51', '68', '35', '28', '58', '33', '31', '50',
       '24', '88', '29', '21', '78', '60', '63', '37', '32', '49',
       '20-29', '47', '18-99', '41', '39', '42', '38', '7', '40-49', '82',
       '61', '34-66', '62', '40-89', '80-89', '55', '0.25', '13-19', '69',
       '16', '8', '10', '25', '34', '55-74', '75-', '70-79', '79',
       '35-54', '55-', '95', '54', '40-50', '46', '48', '57', '56']})

1 个答案:

答案 0 :(得分:1)

您可以用Series.str.split拆分值,并用Series.str.strip删除可能的对等-到2列,并为每列使用cut

df1 = df['Age'].str.strip('-').str.split('-', expand=True).astype(float)
bins = [0,10,20,30,40,50,60,70,80,90,100]

labels = ['{}-{}'.format(i, j-1) for i, j in zip(bins[:-1], bins[1:])] 

g1 = pd.cut(df1[0], bins=bins, right=False, labels=labels)
g2 = pd.cut(df1[1], bins=bins, right=False, labels=labels)

然后比较两者和如果匹配(在Series之间也替换为丢失的值),然后由Series.mask创建新列:

df['age_bins'] = g1.mask(g1.ne(g2.fillna(g1)))
print (df)
      Age age_bins
0      23    20-29
1      64    60-69
2      71    70-79
3      53    50-59
4      40    40-49
..    ...      ...
72  40-50      NaN
73     46    40-49
74     48    40-49
75     57    50-59
76     56    50-59

[77 rows x 2 columns]

不匹配的值:

df1 = df[df['age_bins'].isna()]
print (df1)
      Age age_bins
12   8-68      NaN
13  21-72      NaN
42  18-99      NaN
51  34-66      NaN
53  40-89      NaN
64  55-74      NaN
68  35-54      NaN
72  40-50      NaN