Question

我有一些带有＆＃34;链接的JSON文件＆＃34;属性在里面。以下是这些文件的示例：

{
"links": [
  "https://lastsecond.ir/hotels/1343-metropol-ankara",
  "https://lastsecond.ir/hotels/1347-bianco-boutique"
],
"names": [
  "Metropol Ankara hotel",
  "Bianco Boutique hotel",
  "Asal Ankara hotel",
  .
  .
  .
}

我需要阅读所有这些文件以及每个链接，抓取页面并运行管道。有些文件只在链接上运行，项目管道正确运行该文件，但对于具有多个链接的文件，项目管道仅运行在＆＃34;链接＆＃34;中的最后一个链接。 JSON文件的属性。到目前为止，这是我的蜘蛛代码：

class HotelInfoSpider(scrapy.Spider):

    def start_requests(self):
        files = [f for f in listdir('lastsecond/hotels/') if isfile(join('lastsecond/hotels/', f))]


    for file in files:
        with open('lastsecond/hotels/' + file, 'r') as hotel_info:
            hotel = json.load(hotel_info)
            for link in hotel["links"]:
                yield scrapy.Request(link, meta={'id': file})
    name = 'hotel_info'
    allowed_domains = ['lastsecond.ir']
    custom_settings = {
        'ITEM_PIPELINES': {
            'lastsecond.pipelines.hotelFile': 400
        }
    }
    def parse(self, response):
        tour = ItemLoader(item=tourItem(), response=response)
        tour.add_css('name', '.tours-list h5 a::text')
        tour.add_css('nights', '.tours-list ul.mx-1 li:last-child label::text')
        tour.add_value('found_date', str(datetime.now()))
        tour.add_value('id', response.meta['id'])
        yield tour.load_item()

这是我的管道代码：

class hotelFile(object):
    def process_item(self, item, spider):

        with open('lastsecond/results/' + item['id'][0], 'w') as result:
            json.dump(dict(item), result)
        return item

另外我还有另一个问题，在输出文件中，我只看到我用add_value分配的项目字段。在outpu文件中不存在我使用add_css分配的任何字段。这是我在这段代码中的两个问题。

Answer 1

AlertsTable

您每次都要重新打开文件进行写入，这会导致您覆盖其内容您应该打开要附加的文件，或者只打开一次。

至于你的第二个问题，你的代码看起来应该有效。
确保css选择器实际上与您尝试提取的内容相匹配（特别是第二个看起来非常具体）。

item管道仅在start_requests中的最后一个url执行

1 个答案: