Question

我是Scrapy的新用户并尝试使用它来练习抓取网站。但是，即使我遵循本教程提供的代码，也不会返回结果。看起来yield scrapy.Request不起作用。我的代码如下：

Import scrapy
from bs4 import BeautifulSoup
from apple.items import AppleItem

class Apple1Spider(scrapy.Spider):
    name = 'apple'
    allowed_domains = ['appledaily.com']
    start_urls =['http://www.appledaily.com.tw/realtimenews/section/new/']

    def parse(self, response):
        domain = "http://www.appledaily.com.tw"
        res = BeautifulSoup(response.body)
        for news in res.select('.rtddt'):
            yield scrapy.Request(domain + news.select('a')[0]['href'], callback=self.parse_detail)

    def parse_detail(self, response):
        res = BeautifulSoup(response.body)
        appleitem = AppleItem()
        appleitem['title'] = res.select('h1')[0].text
        appleitem['content'] = res.select('.trans')[0].text
        appleitem['time'] = res.select('.gggs time')[0].text
        return appleitem

它显示蜘蛛被打开和关闭但它什么也没有返回。 Python的版本是3.6。有人可以帮忙吗？感谢。

编辑我

可以到达抓取日志here。

编辑II

如果我更改下面的代码，可能会使问题更加清晰：

Import scrapy
from bs4 import BeautifulSoup


class Apple1Spider(scrapy.Spider):
    name = 'apple'
    allowed_domains = ['appledaily.com']
    start_urls = ['http://www.appledaily.com.tw/realtimenews/section/new/']

    def parse(self, response):
        domain = "http://www.appledaily.com.tw"
        res = BeautifulSoup(response.body)
        for news in res.select('.rtddt'):
            yield scrapy.Request(domain + news.select('a')[0]['href'], callback=self.parse_detail)

    def parse_detail(self, response):
        res = BeautifulSoup(response.body)
        print(res.select('#h1')[0].text)

代码应分别打印出网址和标题，但不会返回任何内容。

Answer 1

您的日志说明：

2017-07-10 19:12:47 [scrapy.spidermiddlewares.offsite] DEBUG：已过滤的异地请求＆＃39; www.appledaily.com.tw＆＃39;：http：//www.appledaily.com.tw/realtimenews/article/life/201 一百十五万八千一百七十七分之七万〇七百十/ oBike％E7％A6％81％E5％81％9C％E6％A9％9F％E8％BB％8A％E6％A0 BC％％E3％80％80％E6％96％B0％E5％ 8C％ 97％E7％81％AB％E9％80％9F％E5％86％8D％E5％85％AC％E5％91％8A6％E5％8D％80％E7％A6％81％E5％81％9C＆GT;

你的蜘蛛被设置为：

allowed_domains = ['appledaily.com']

所以应该是：

allowed_domains = ['appledaily.com.tw']

Answer 2

您感兴趣的parse方法内容（即带有类rtddt的列表项）似乎是动态生成的 - 例如可以使用Chrome进行检查，但不会出现在HTML源代码（Scrapy获得的响应）。

您必须首先使用某些内容来为Scrapy呈现页面。我会将Splash与scrapy-splash包一起推荐。

yield scrapy.Request不返回标题

2 个答案: