Scrapy Spider在给出选择器列表

时间:2017-10-19 17:55:04

标签: python xpath scrapy scrapy-spider

我已经遇到了一个我把它放在一起的蜘蛛问题。我试图从this site的成绩单中删除单独的文本行及其相应的时间戳,并找到我认为合适的选择器,但是在运行时,蜘蛛的输出只是最后一行和时间戳。我已经看到其他几个有类似问题的人,但还没有找到解决我问题的答案。

这是蜘蛛:

# -*- coding: utf-8 -*-
import scrapy
from this_american_life.items import TalTranscriptItem

class CrawlSpider(scrapy.Spider):
    name = "transcript2"
    allowed_domains = ["https://www.thisamericanlife.org/radio-archives/episode/1/transcript"]
    start_urls = (
        'https://www.thisamericanlife.org/radio-archives/episode/1/transcript',
    )

    def parse(self, response):
        item = TalTranscriptItem()
        for line in response.xpath('//p'):
            item['begin_timestamp'] = line.xpath('//@begin').extract()
            item['line_text'] = line.xpath('//text()').extract()
        yield item

以下是TalTranscriptItem()items.py的代码:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class TalTranscriptItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    episode_id = scrapy.Field()
    episode_num_text = scrapy.Field()
    year = scrapy.Field()
    radio_date_text = scrapy.Field()
    radio_date_datetime = scrapy.Field()
    episode_title = scrapy.Field()
    episode_hosts = scrapy.Field()
    act_id = scrapy.Field()
    line_id = scrapy.Field()
    begin_timestamp = scrapy.Field()
    speaker_class = scrapy.Field()
    speaker_name = scrapy.Field()
    line_text = scrapy.Field()
    full_audio_link = scrapy.Field()
    transcript_url = scrapy.Field()

scrapy shell中运行时,它似乎正常工作(绘制所有文本行),但由于某种原因,我无法让它在蜘蛛中工作。< / p>

我很高兴澄清任何这些问题,非常感谢任何人提供的任何帮助!

2 个答案:

答案 0 :(得分:1)

如果您希望每个单独的行作为项目产生,我认为这是您想要的(注意<xsl:apply-templates select="presidents/president[number(substring(left_office, string-length(left_office) - 3)) >= 2000]"/> 行的最后一个缩进):

yield

答案 1 :(得分:0)

我不知道你能做什么项目:

item = []

for line in response.xpath('//p'):
   dictItem = {'begin_timestamp':line.xpath('//@begin').extract(),'line_text':line.xpath('//text()').extract()}
   item.append(dictItem)

print(item)