Question

这个蜘蛛的想法是将文件中的关键字列表与页面中的文本进行比较，然后提取字符串/段落。

鉴于下一个蜘蛛：

import scrapy
import sys
reload(sys)
sys.setdefaultencoding('utf8')

class StringGrab(scrapy.Spider):
name = "stringpage"
start_urls = [
    'https://whatscookingamerica.net/Glossary/A.htm',

]

def parse(self, response):

    in_file = tuple(open("dictionarA1.csv", "r"))

    for word in in_file:
        for para_text in response.xpath(u'//p/text()[contains(..,"{0}")]'.format(word)).extract():        
            yield {
                'dictA': para_text,
            }

custom_settings = {
    "DOWNLOAD_DELAY": 1,
    "CONCURRENT_REQUESTS_PER_DOMAIN": 10
}

我的问题就像在标题中一样，蜘蛛爬行，为每个链接显示代码200（如果添加了更多）但它没有在csv文件中提供任何输出。

编辑：

如果我在搜索和提取中引入一个关键字，它就可以了

'dictA0': word.xpath('//p/text()[contains(..,"a la")]').extract(),
'dictA1':  word.xpath('//p/text()[contains(..,"a la Anglaise")]').extract(),

因此在文件中提取并写入包含＆＃34; a la＆＃34;和＆＃34; a la Anglaise＆＃34;。

Scrapy爬行但不与文件进行比较并返回null

0 个答案: