从pandas数据帧单元格中的字符串值中提取整数

时间:2016-05-24 12:52:39

标签: python-2.7 pandas ipython

我有一个数据框train_df,它有一个名为AgeuponOutcome的列,其中包含如下值 AgeuponOutcome

1 year
1 year
2 years
3 weeks
2 years
1 month
3 weeks
3 weeks
5 months
1 year
2 years
2 years
4 years

以下是我的完整数据集。

OutcomeType AnimalType  SexuponOutcome  AgeuponOutcome  Breed   Color
Return_to_owner Dog Neutered Male   1 year  Shetland Sheepdog Mix   Brown/White
Euthanasia  Cat Spayed Female   1 year  Domestic Shorthair Mix  Cream Tabby
Adoption    Dog Neutered Male   2 years Pit Bull Mix    Blue/White
Transfer    Cat Intact Male 3 weeks Domestic Shorthair Mix  Blue Cream
Transfer    Dog Neutered Male   2 years Lhasa Apso/Miniature Poodle Tan
Transfer    Dog Intact Female   1 month Cairn Terrier/Chihuahua Shorthair   Black/Tan
Transfer    Cat Intact Male 3 weeks Domestic Shorthair Mix  Blue Tabby
Transfer    Cat Unknown 3 weeks Domestic Shorthair Mix  Brown Tabby
Adoption    Dog Spayed Female   5 months    American Pit Bull Terrier Mix   Red/White
Adoption    Dog Spayed Female   1 year  Cairn Terrier   White
Transfer    Cat Unknown 2 years Domestic Shorthair Mix  Black
Adoption    Dog Spayed Female   2 years Miniature Schnauzer Mix Silver
Adoption    Dog Neutered Male   4 years Pit Bull Mix    Brown

下面给出了我用来从AgeuponOutcome列中的字符串值中提取整数的代码。

word='month'  
l = len(train_df)
    for i in range(l):
        if word in train_df.loc[i, 'AgeuponOutcome']: 
            print re.findall("\d+", train_df.loc[i, 'AgeuponOutcome'])

但它会出现以下错误

TypeError                                 Traceback (most recent call last)
<ipython-input-60-8a6df57d3cb9> in <module>()
      1 l = len(train_df)
      2 for i in range(l):
----> 3     if word in train_df.loc[i, 'AgeuponOutcome']:
      4         print re.findall("\d+", train_df.loc[i, 'AgeuponOutcome'])

TypeError: argument of type 'int' is not iterable

您能告诉我如何修复错误并提取值。 例如,我需要从1年后提取1个&#39;并打印1

1 个答案:

答案 0 :(得分:2)

您可以将str.extractcontainslocboolean indexing一起使用:

df1 = (df.AgeuponOutcome.str.extract('(\d+) (\w+)', expand=True))
df1.columns = ['a','b']
print (df1)
    a       b
0   1    year
1   1    year
2   2   years
3   3   weeks
4   2   years
5   1   month
6   3   weeks
7   3   weeks
8   5  months
9   1    year
10  2   years
11  2   years
12  4   years

print (df1.loc[df1.b.str.contains('month'), 'a'])
5    1
8    5
Name: a, dtype: object

print (df1.loc[df1.b.str.contains('year'), 'a'])
0     1
1     1
2     2
4     2
9     1
10    2
11    2
12    4
Name: a, dtype: object

如果您需要输出为新列:

df1['month'] = (df1.loc[df1.b.str.contains('month'), 'a'])
df1['year'] = (df1.loc[df1.b.str.contains('year'), 'a'])
df1['week'] = (df1.loc[df1.b.str.contains('week'), 'a'])
print (df1)
    a       b month year week
0   1    year   NaN    1  NaN
1   1    year   NaN    1  NaN
2   2   years   NaN    2  NaN
3   3   weeks   NaN  NaN    3
4   2   years   NaN    2  NaN
5   1   month     1  NaN  NaN
6   3   weeks   NaN  NaN    3
7   3   weeks   NaN  NaN    3
8   5  months     5  NaN  NaN
9   1    year   NaN    1  NaN
10  2   years   NaN    2  NaN
11  2   years   NaN    2  NaN
12  4   years   NaN    4  NaN

通过评论编辑:

您可以使用:

#convert to int
df1['a'] = df1.a.astype(int)

#divide by constant to column a
df1.loc[df1.b.str.contains('month'), 'a'] = df1.loc[df1.b.str.contains('month'), 'a'] / 12
df1.loc[df1.b.str.contains('week'), 'a'] = df1.loc[df1.b.str.contains('week'), 'a']  /52.1429
print (df1)
           a       b
0   1.000000    year
1   1.000000    year
2   2.000000   years
3   0.057534   weeks
4   2.000000   years
5   0.083333   month
6   0.057534   weeks
7   0.057534   weeks
8   0.416667  months
9   1.000000    year
10  2.000000   years
11  2.000000   years
12  4.000000   years