你好,我有两个数据框。一个是主db1(它具有许多行),第二个是sourcetarget(它较小)。我想查看 db1 中 sourcetarget 中的所有单词,如果匹配,我将创建一个新的布尔列(0,1)。我尝试了这段代码(具有很高的复杂性),但我总是得到0。什么地方出了错?
start_time = time.time()
compt=0
for i in db1.clean_nomComplet:
for j in sourcetarget.sourcetarget:
res0 = i.find(j)
if res0 >= 0:
db1['top'] = 1
else:
db1['top'] = 0
compt+=1
print(compt/len(db1)*100,end="\r")
if compt%50000 == 0:
print("../data_out/sauve"+str(compt)+'.csv')
db1.to_csv('../data_out/sauve'+str(compt)+'.csv', encoding='utf-8-sig')
print("--- %s seconds ---" % (time.time() - start_time))```
答案 0 :(得分:1)
我发现进行这种比较的最好方法是:
#1. You transform the values you want to check on as a set
# because you don't care about having them ordered. This saves A LOT of complexity
source = set(sourcetarget.sourcetarget.values)
# 2. Use the isin function
db1['top'] = 0
db1.loc[db1['clean_nomComplet'].isin(source), 'top'] = 1
脚本中的问题是您更改了整个列的值。您应该使用:
for index, row in db1.iterrows():
[...]
if res0 >= 0:
db1.loc[index,'top'] = 1
else:
db1[index, 'top'] = 0