我已经遇到了一个我把它放在一起的蜘蛛问题。我试图从this site的成绩单中删除单独的文本行及其相应的时间戳,并找到我认为合适的选择器,但是在运行时,蜘蛛的输出只是最后一行和时间戳。我已经看到其他几个有类似问题的人,但还没有找到解决我问题的答案。
这是蜘蛛:
# -*- coding: utf-8 -*-
import scrapy
from this_american_life.items import TalTranscriptItem
class CrawlSpider(scrapy.Spider):
name = "transcript2"
allowed_domains = ["https://www.thisamericanlife.org/radio-archives/episode/1/transcript"]
start_urls = (
'https://www.thisamericanlife.org/radio-archives/episode/1/transcript',
)
def parse(self, response):
item = TalTranscriptItem()
for line in response.xpath('//p'):
item['begin_timestamp'] = line.xpath('//@begin').extract()
item['line_text'] = line.xpath('//text()').extract()
yield item
以下是TalTranscriptItem()
中items.py
的代码:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class TalTranscriptItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
episode_id = scrapy.Field()
episode_num_text = scrapy.Field()
year = scrapy.Field()
radio_date_text = scrapy.Field()
radio_date_datetime = scrapy.Field()
episode_title = scrapy.Field()
episode_hosts = scrapy.Field()
act_id = scrapy.Field()
line_id = scrapy.Field()
begin_timestamp = scrapy.Field()
speaker_class = scrapy.Field()
speaker_name = scrapy.Field()
line_text = scrapy.Field()
full_audio_link = scrapy.Field()
transcript_url = scrapy.Field()
在scrapy shell
中运行时,它似乎正常工作(绘制所有文本行),但由于某种原因,我无法让它在蜘蛛中工作。< / p>
我很高兴澄清任何这些问题,非常感谢任何人提供的任何帮助!
答案 0 :(得分:1)
如果您希望每个单独的行作为项目产生,我认为这是您想要的(注意<xsl:apply-templates
select="presidents/president[number(substring(left_office, string-length(left_office) - 3)) >= 2000]"/>
行的最后一个缩进):
yield
答案 1 :(得分:0)
我不知道你能做什么项目:
item = []
for line in response.xpath('//p'):
dictItem = {'begin_timestamp':line.xpath('//@begin').extract(),'line_text':line.xpath('//text()').extract()}
item.append(dictItem)
print(item)