我有一个数据框train_df,它有一个名为AgeuponOutcome的列,其中包含如下值 AgeuponOutcome
1 year
1 year
2 years
3 weeks
2 years
1 month
3 weeks
3 weeks
5 months
1 year
2 years
2 years
4 years
以下是我的完整数据集。
OutcomeType AnimalType SexuponOutcome AgeuponOutcome Breed Color
Return_to_owner Dog Neutered Male 1 year Shetland Sheepdog Mix Brown/White
Euthanasia Cat Spayed Female 1 year Domestic Shorthair Mix Cream Tabby
Adoption Dog Neutered Male 2 years Pit Bull Mix Blue/White
Transfer Cat Intact Male 3 weeks Domestic Shorthair Mix Blue Cream
Transfer Dog Neutered Male 2 years Lhasa Apso/Miniature Poodle Tan
Transfer Dog Intact Female 1 month Cairn Terrier/Chihuahua Shorthair Black/Tan
Transfer Cat Intact Male 3 weeks Domestic Shorthair Mix Blue Tabby
Transfer Cat Unknown 3 weeks Domestic Shorthair Mix Brown Tabby
Adoption Dog Spayed Female 5 months American Pit Bull Terrier Mix Red/White
Adoption Dog Spayed Female 1 year Cairn Terrier White
Transfer Cat Unknown 2 years Domestic Shorthair Mix Black
Adoption Dog Spayed Female 2 years Miniature Schnauzer Mix Silver
Adoption Dog Neutered Male 4 years Pit Bull Mix Brown
下面给出了我用来从AgeuponOutcome列中的字符串值中提取整数的代码。
word='month'
l = len(train_df)
for i in range(l):
if word in train_df.loc[i, 'AgeuponOutcome']:
print re.findall("\d+", train_df.loc[i, 'AgeuponOutcome'])
TypeError Traceback (most recent call last)
<ipython-input-60-8a6df57d3cb9> in <module>()
1 l = len(train_df)
2 for i in range(l):
----> 3 if word in train_df.loc[i, 'AgeuponOutcome']:
4 print re.findall("\d+", train_df.loc[i, 'AgeuponOutcome'])
TypeError: argument of type 'int' is not iterable
您能告诉我如何修复错误并提取值。 例如,我需要从1年后提取1个&#39;并打印1
答案 0 :(得分:2)
您可以将str.extract
与contains
和loc
与boolean indexing
一起使用:
df1 = (df.AgeuponOutcome.str.extract('(\d+) (\w+)', expand=True))
df1.columns = ['a','b']
print (df1)
a b
0 1 year
1 1 year
2 2 years
3 3 weeks
4 2 years
5 1 month
6 3 weeks
7 3 weeks
8 5 months
9 1 year
10 2 years
11 2 years
12 4 years
print (df1.loc[df1.b.str.contains('month'), 'a'])
5 1
8 5
Name: a, dtype: object
print (df1.loc[df1.b.str.contains('year'), 'a'])
0 1
1 1
2 2
4 2
9 1
10 2
11 2
12 4
Name: a, dtype: object
如果您需要输出为新列:
df1['month'] = (df1.loc[df1.b.str.contains('month'), 'a'])
df1['year'] = (df1.loc[df1.b.str.contains('year'), 'a'])
df1['week'] = (df1.loc[df1.b.str.contains('week'), 'a'])
print (df1)
a b month year week
0 1 year NaN 1 NaN
1 1 year NaN 1 NaN
2 2 years NaN 2 NaN
3 3 weeks NaN NaN 3
4 2 years NaN 2 NaN
5 1 month 1 NaN NaN
6 3 weeks NaN NaN 3
7 3 weeks NaN NaN 3
8 5 months 5 NaN NaN
9 1 year NaN 1 NaN
10 2 years NaN 2 NaN
11 2 years NaN 2 NaN
12 4 years NaN 4 NaN
通过评论编辑:
您可以使用:
#convert to int
df1['a'] = df1.a.astype(int)
#divide by constant to column a
df1.loc[df1.b.str.contains('month'), 'a'] = df1.loc[df1.b.str.contains('month'), 'a'] / 12
df1.loc[df1.b.str.contains('week'), 'a'] = df1.loc[df1.b.str.contains('week'), 'a'] /52.1429
print (df1)
a b
0 1.000000 year
1 1.000000 year
2 2.000000 years
3 0.057534 weeks
4 2.000000 years
5 0.083333 month
6 0.057534 weeks
7 0.057534 weeks
8 0.416667 months
9 1.000000 year
10 2.000000 years
11 2.000000 years
12 4.000000 years