我有一个独特的问题,我主要希望找到加快这段代码的方法。我有一组存储在数据框中的字符串,每个字符串中都有几个名字,我知道这一步之前的名字数量,如下所示:
print df
description num_people people
'Harry ran with sally' 2 []
'Joe was swinging with sally' 2 []
'Lola Dances alone' 1 []
我正在使用带有我希望在描述中找到的键的字典,如下所示:
my_dict={'Harry':'1283','Joe':'1828','Sally':'1298', 'Cupid':'1982'}
然后使用iterrows在每个字符串中搜索匹配项,如下所示:
for index, row in df.iterrows():
row.people=[key for key in my_dict if re.findall(key,row.desciption)]
并且在运行时最终以
结束print df
description num_people people
'Harry ran with sally' 2 ['Harry','Sally']
'Joe was swinging with sally' 2 ['Joe','Sally']
'Lola Dances alone' 1 ['Lola']
我看到的问题是,这段代码完成工作仍然相当慢,而且我有大量的描述和1000
个密钥。是否有更快的方式来执行此操作,例如可能使用找到的人数?
答案 0 :(得分:2)
更快的解决方案:
#strip ' in start and end of text, create lists from words
splited = df.description.str.strip("'").str.split()
#filtering
df['people'] = splited.apply(lambda x: [i for i in x if i in my_dict.keys()])
print (df)
description num_people people
0 'Harry ran with Sally' 2 [Harry, Sally]
1 'Joe was swinging with Sally' 2 [Joe, Sally]
2 'Lola Dances alone' 1 [Lola]
<强>计时强>:
#[30000 rows x 3 columns]
In [198]: %timeit (orig(my_dict, df))
1 loop, best of 3: 3.63 s per loop
In [199]: %timeit (new(my_dict, df1))
10 loops, best of 3: 78.2 ms per loop
df['people'] = [[],[],[]]
df = pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
my_dict={'Harry':'1283','Joe':'1828','Sally':'1298', 'Lola':'1982'}
def orig(my_dict, df):
for index, row in df.iterrows():
df.at[index, 'people']=[key for key in my_dict if re.findall(key,row.description)]
return (df)
def new(my_dict, df):
df.description = df.description.str.strip("'")
splited = df.description.str.split()
df.people = splited.apply(lambda x: [i for i in x if i in my_dict.keys()])
return (df)
print (orig(my_dict, df))
print (new(my_dict, df1))