我想从数据框列中搜索关键字,称为“字符串”。
关键字包含在字典中。
对于每个键,该值是一个包含多个关键字的数组。
我担心的是速度非常低,需要花费很多时间。
可能涉及很多循环,df.str.contains
无法使用。
如何加快这个过程?
def match(string, keyword):
m = len(string)
n = len(keyword)
idx = string.find(keyword)
if idx == -1:
return 0
if len(re.findall('[a-zA-Z]', string[idx])) > 0:
if idx > 0:
if len(re.findall('[a-zA-Z]', string[idx - 1])) > 0:
return 0
if len(re.findall('[a-zA-Z]', string[idx+n-1])) > 0:
if idx + n < m:
if len(re.findall('[a-zA-Z]', string[idx + n])) > 0:
return 0
return 1
def match_keyword(df, keyword_dict, name):
df_new = pd.DataFrame()
for owner_id, keyword in keyword_dict.items():
try:
for index, data in df.iterrows():
a = [match(data['string'], word) for word in keyword]
t = int(np.sum(a))
if t > 0:
df_new.loc[index, name+'_'+str(owner_id)] = 1
else:
df_new.loc[index, name+'_'+str(owner_id)] = 0
except:
df_new[name+'_'+str(owner_id)] = 0
return df_new.astype(int)
输入:
String
0 New Beauty Company is now offering 超級discounts
1 Swimming is good for children and adults
2 Children love food though it may not be good
keywords:{'a':['New', 'is', '超級'], 'b':['Swim', 'discounts', 'good']}
结果:
'New' 'is' '超級' result(or relation)
0 1 1 1 1
1 0 1 0 1
2 0 0 0 0
'Swim' 'discounts' 'good' result(or relation)
0 0 1 0 1
1 0 0 1 1
2 0 0 1 1
最终结果:
'a' 'b'
0 1 1
1 1 1
2 0 1
答案 0 :(得分:2)
我认为需要str.contains
循环显示d {word bondaries
\b
|
加注OR
:
for k, v in keywords.items():
pat = '|'.join(r"\b{}\b".format(x) for x in v)
#print (pat)
df[k] = df['String'].str.contains(pat).astype(int)
print (df)
String a b
0 New Beauty Company is now offering discounts 1 1
1 Swimming is good for children and adults 1 1
2 Children love food though it may not be good 0 1
如果每个值都需要列,并在列中创建MultiIndex:
df = df.set_index('String')
for k, v in keywords.items():
for x in v:
df[(k, x)] = df.index.str.contains(x).astype(int)
df.columns = pd.MultiIndex.from_tuples(df.columns)
print (df)
a b
New is Swim discounts good
String
New Beauty Company is now offering discounts 1 1 0 1 0
Swimming is good for children and adults 0 1 1 0 1
Children love food though it may not be good 0 0 0 0 1
然后可以通过max
获取MultiIndex
:
df = df.max(axis=1, level=0)
print (df)
a b
String
New Beauty Company is now offering discounts 1 1
Swimming is good for children and adults 1 1
Children love food though it may not be good 0 1