我有一个数据框,其中包含一个访问过的行列(在许多其他列中):
Index User Boroughs_visited
0 Eminem Manhattan, Bronx
1 BrSpears NaN
2 Elvis Brooklyn
3 Adele Queens, Brooklyn
我想创建第三列,显示哪个用户访问了布鲁克林,所以我在python中编写了最慢的代码:
df['Brooklyn']= 0
def borough():
for index,x in enumerate(df['Boroughs_visited']):
if pd.isnull(x):
continue
elif re.search(r'\bBrooklyn\b',x):
df_vols['Brooklyn'][index]= 1
borough()
导致:
Index User Boroughs_visited Brooklyn
0 Eminem Manhattan, Bronx 0
1 BrSpears NaN 0
2 Elvis Brooklyn 1
3 Adele Queens, Brooklyn 1
我的计算机用了15秒才能运行2000行。有没有更快的方法呢?
答案 0 :(得分:2)
让.str
和contains
使用fillna
访问者:
df['Brooklyn'] = (df.Boroughs_visited.str.contains('Brooklyn') * 1).fillna(0)
或同一陈述的另一种格式:
df['Brooklyn'] = df.Boroughs_visited.str.contains('Brooklyn').mul(1, fill_value=0)
输出:
Index User Boroughs_visited Brooklyn
0 0 Eminem Manhattan, Bronx 0
1 1 BrSpears NaN None 0
2 2 Elvis Brooklyn 1
3 3 Adele Queens, Brooklyn 1
答案 1 :(得分:1)
您可以以一个
的价格获得所有自治市镇df.join(df.Boroughs_visited.str.get_dummies(sep=', '))
Index User Boroughs_visited Bronx Brooklyn Manhattan Queens
0 0 Eminem Manhattan, Bronx 1 0 1 0
1 1 BrSpears NaN 0 0 0 0
2 2 Elvis Brooklyn 0 1 0 0
3 3 Adele Queens, Brooklyn 0 1 0 1
但如果你真的,真的只是想要布鲁克林
df.join(df.Boroughs_visited.str.get_dummies(sep=', ').Brooklyn)
Index User Boroughs_visited Brooklyn
0 0 Eminem Manhattan, Bronx 0
1 1 BrSpears NaN 0
2 2 Elvis Brooklyn 1
3 3 Adele Queens, Brooklyn 1