我有一列似乎具有4种不同格式的数据,我创建了一个带有数组的小片段来说明我正在使用的内容
ex_array = np.array(['100X172',
'78X120',
'1 ac',
'76,666',
'85X175',
'19,928',
'14810',
'3 ac',
'90X181',
'38X150',
'19040',
'8265',
'100X125',
'6000',
'8,750',
'.448 ac'])
ex_df = pd.DataFrame(data=ex_array, columns=['ex_col'])
这将按预期输出以下内容:
ex_col
0 100X172
1 78X120
2 1 ac
3 76,666
4 85X175
5 19,928
6 14810
7 3 ac
8 90X181
9 38X150
10 19040
11 8265
12 100X125
13 6000
14 8,750
15 .448 ac
目标是使所有内容均以英亩为单位的列标准化,并且所需的输出如下所示
ex_df['acreage'] =
acreage
0 .394858
1 .214876
2 1
3 1.76
4 .341483
5 .457484
6 .339991
7 3
8 .373967
9 .130854
10 .437098
11 .189738
12 .284665
13 .137741
14 .200872
15 .448
我在熊猫中的想法是创建3个布尔列以处理不同类型的数据
ex_df['hasX'] = ex_df['ex_col'].str.contains('X')
ex_df['has_ac'] = ex_df['ex_col'].str.contains('ac')
ex_df['has_comma'] = ex_df['ex_col'].str.contains(',')
这将按预期输出
ex_df
ex_col hasX has_ac has_comma
0 100X172 True False False
1 78X120 True False False
2 1 ac False True False
3 76,666 False False True
4 85X175 True False False
5 19,928 False False True
6 14810 False False False
7 3 ac False True False
8 90X181 True False False
9 38X150 True False False
10 19040 False False False
11 8265 False False False
12 100X125 True False False
13 6000 False False False
14 8,750 False False True
15 .448 ac False True False
接下来,我尝试了如下多次定位操作
ex_df.loc[(ex_df['hasX']==True), 'acreage']= ex_df['ex_col'].apply(lambda x: float(((int(x.split('X')[0]))*(int(x.split('X')[-1])))/43560))
ex_df.loc[(ex_df['has_ac']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x.split()[0]))
ex_df.loc[(ex_df['has_comma']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x.replace(',','')))
ex_df.loc[((ex_df['hasX']==False) & (ex_df['has_ac']==False) & (ex_df['has_comma']==False)), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x))
这将输出以下错误:
<ipython-input-40-e9eb1eacbacb> in <module>
----> 1 ex_df.loc[(ex_df['hasX']==True), 'acreage']= ex_df['ex_col'].apply(lambda x: float(((int(x.split('X')[0]))*(int(x.split('X')[-1])))/43560))
2 ex_df.loc[(ex_df['has_ac']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: x.split()[0])
3 ex_df.loc[(ex_df['has_comma']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x.replace(',','')))
4 ex_df.loc[((ex_df['hasX']==False) & (ex_df['has_ac']==False) & (ex_df['has_comma']==False)), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x))
4198 else:
4199 values = self.astype(object)._values
-> 4200 mapped = lib.map_infer(values, f, convert=convert_dtype)
4201
4202 if len(mapped) and isinstance(mapped[0], Series):
pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-40-e9eb1eacbacb> in <lambda>(x)
----> 1 ex_df.loc[(ex_df['hasX']==True), 'acreage']= ex_df['ex_col'].apply(lambda x: float(((int(x.split('X')[0]))*(int(x.split('X')[-1])))/43560))
2 ex_df.loc[(ex_df['has_ac']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: x.split()[0])
3 ex_df.loc[(ex_df['has_comma']==True), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x.replace(',','')))
4 ex_df.loc[((ex_df['hasX']==False) & (ex_df['has_ac']==False) & (ex_df['has_comma']==False)), 'acreage'] = ex_df['ex_col'].apply(lambda x: float(x))
ValueError: invalid literal for int() with base 10: '1 ac'
答案 0 :(得分:1)
让我们尝试extract
拆分数据列并使用np.select
进行映射:
data = (ex_df['ex_col'].str.replace(',','')
.str.extract('([\.\d]+)\s?(ac|X)?([\.\d,]+)?')
)
data[[0,2]] = data[[0,2]].astype(float)
ex_df['area'] = np.select((data[1].eq('X'), data[1].eq('ac')),
(data[0]* data[2]/43560,data[0]),
data[0]/43560 )
输出:
ex_col area
0 100X172 0.394858
1 78X120 0.214876
2 1 ac 1.000000
3 76,666 1.760009
4 85X175 0.341483
5 19,928 0.457484
6 14810 0.339991
7 3 ac 3.000000
8 90X181 0.373967
9 38X150 0.130854
10 19040 0.437098
11 8265 0.189738
12 100X125 0.286961
13 6000 0.137741
14 8,750 0.200872
15 .448 ac 0.448000