我在检查scrapy中的现有数据时遇到了麻烦。我已经使用Elasticsearch作为我要执行的代码下面的数据库?
def checkIfURLExistsInCrawler(single_url):
elastic_query = json.dumps({
"query": {
"match_phrase": {
"url": single_url
}
}
})
result = es.search(index='test', doc_type='test', body=elastic_query)['hits']['hits']
return result
def start_requests(self):
urls = [
# here i have some url there might be chance
# that some urls are duplicate so i have to put
# validation but in for loop it doesn't working
]
for request_url in urls:
checkExists = self.checkIfURLExistsInCrawler(request_url)
if not checkExists :
beingCrawledUrl = {}
beingCrawledUrl['url'] = single_url
beingCrawledUrl['added_on'] = now.strftime("%Y-%m-%d %H:%M:%S")
json_data = json.dumps(beingCrawledUrl)
InsertData = es.index(index='test', doc_type='test', body=json.loads(json_data))
yield scrapy.Request();
如果我执行此代码,则urls = []中的所有记录都将插入“测试”索引中,即使由于上面的验证而使其重复也不起作用。
但是如果我使用相同的数据验证功能再次运行此程序,那么任何人都可以帮忙。