我正在使用scrapy构建一个web抓取工具,它只占用首页的所有reddit链接。当我尝试将它放入json文件夹时,我得到的只是'['。
这是我的蜘蛛。
from scrapy import Spider
from scrapy.selector import Selector
from redditScrape.items import RedditscrapeItem
class RedditSpider(Spider):
name = "redditScrape"
allowed_domains = ["reddit.com"]
start_urls = [
"https://www.reddit.com/r/all"
]
def parse(self, response):
titles = Selector(response).xpath('//div[@class="entry unvoted lcTagged"]/p[@class="title"]')
for title in titles:
item = RedditscrapeItem()
item['title'] = title.xpath('/a[@class="title may-blank loggedin srTagged imgScanned"]/text()').extract()
yield item
每当我在谷歌Chrome控制台中运行xpath查询时,我都会得到我正在寻找的结果。
知道我的刮刀输出不正确的原因吗?
这是我用来执行的命令:
scrapy crawl redditScrape -o items.json -t json
答案 0 :(得分:1)
我不确切地知道问题是什么,但我会在我的代码中看到错误。
首先,我不知道-t
参数是什么,但我怀疑你想要确保输出是一个json文件。你不需要。 -o items.json
就够了。 scrapy crawl redditScrape -o items.json
您无需声明Selector
,您也可以titles = response.xpath('//div[@class="entry unvoted lcTagged"]/p[@class="title"]')
。这不是一个错误,而是提高生活质量。
第二个xpath是至少可以说item['title'] = title.xpath('a[@class="title may-blank loggedin srTagged imgScanned"]/text()').extract_first()
每当一个项成功产生时,scrapy会在运行时将它添加到输出文件中。
编辑。
您只需使用此xpath //p[@class="title"]/a/text()
即可从首页获取所有标题。在你的代码中,它看起来像这样
for title in response.xpath('//p[@class="title"]/a'):
item = RedditscrapeItem()
item['title'] = title.xpath('text()').extract_first()
yield item
答案 1 :(得分:1)
这个css选择器将获得所有标题:
In [13]: response.css("a.title.may-blank::text").extract()
Out[13]:
[u'TIL of a millionaire who announced he would bury his Bentley for his afterlife. After lots of negative reaction, he revealed the publicity stunt about organ donations. "People bury things that are much more valuable then cars and nobody seems to care".',
u'Dog thinks he has a bunch of friends',
u'Sewage leak at a movie theater. Looks like black tile.',
u'3:48 am "Hydraulic Press"',
u'I told her it was for their protection...',
u'Long visits to nature linked to improved mental health, study finds',
u"Vladimir Putin Says Brexit Caused by British Politicians 'Arrogance'",
u"World's smallest man dancing with his pet cat. 26th October 1956.",
u'I am Sue Sullivan and Reddit saved my sauce and rub company, Hot Squeeze. Tomorrow, I\u2019m heading to Wal-Mart for my last, big pitch for distribution. Whatever happens, I wanted to say thank you for all your support and AMA! Helping me out with this AMA will be Lawrence Wu, the founder WUJU hot sauce!',
u"Cartoons made me think dog catchers were super common, but now I'm pretty sure they don't even exist",
u'Zarya ultimate chain kill',
u'Shaqiri scores vs Poland to make it 1-1',
u'Mythbusters, during their later seasons',
u"'Why not Texit?': Texas nationalists look to the Brexit vote for inspiration",
u'Ken M on Hitler',
u'Skill, pure skill',
u'My girlfriend paints things. This is a pair of Vans she is currently working on.',
u'I made a magnet wall to display my PS4 steelbook game collection!',
u'HuffPo in 2008: "Muslims appear to be far more concerned about perceived slights to their religion than about atrocities committed daily in its name"',
u"It's been almost 3 years since the removal of the Rose block. Never forget.",
u"Xherdan Shaqiri's insane bicycle kick goal vs. Poland",
u"US Customs wants to collect social media account names at the border: 'Please enter information associated with your online presence'",
u'How was the cameraman for Finding Dory able to hold his breath for the entire filming?',
u'Star Guardian Urgot',
u'I made some doorstops! (Not as lame as it sounds)']
要添加项目,您的代码只需要:
In [9]: for text in response.css("a.title.may-blank::text").extract():
...: item['title'] = text
...: yield item