在熊猫数据框中搜索字符串

时间:2019-10-22 13:36:32

标签: python pandas for-loop

你好,我有两个数据框。一个是主db1(它具有许多行),第二个是sourcetarget(它较小)。我想查看 db1 sourcetarget 中的所有单词,如果匹配,我将创建一个新的布尔列(0,1)。我尝试了这段代码(具有很高的复杂性),但我总是得到0。什么地方出了错?

start_time = time.time()

compt=0
for i in db1.clean_nomComplet:
    for j in sourcetarget.sourcetarget:
        res0 = i.find(j)
        if res0 >= 0:     
            db1['top'] = 1
        else:
            db1['top'] = 0
    compt+=1    
    print(compt/len(db1)*100,end="\r")
    if compt%50000 == 0:
        print("../data_out/sauve"+str(compt)+'.csv')
        db1.to_csv('../data_out/sauve'+str(compt)+'.csv', encoding='utf-8-sig')

print("--- %s seconds ---" % (time.time() - start_time))```

1 个答案:

答案 0 :(得分:1)

我发现进行这种比较的最好方法是:

#1. You transform the values you want to check on as a set
# because you don't care about having them ordered. This saves A LOT of complexity
source = set(sourcetarget.sourcetarget.values)

# 2. Use the isin function
db1['top'] = 0
db1.loc[db1['clean_nomComplet'].isin(source), 'top'] = 1

脚本中的问题是您更改了整个列的值。您应该使用:

for index, row in db1.iterrows():
    [...]
    if res0 >= 0:     
        db1.loc[index,'top'] = 1
    else:
        db1[index, 'top'] = 0