Question

我有以下代码部分正常工作，

class ThreadSpider(CrawlSpider):
    name = 'thread'
    allowed_domains = ['bbs.example.com']
    start_urls = ['http://bbs.example.com/diy']

    rules = (
        Rule(LinkExtractor(
            allow=(),
            restrict_xpaths=("//a[contains(text(), 'Next Page')]")
        ),
            callback='parse_item',
            process_request='start_requests',
            follow=True),
    )

def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(url, self.parse_item, args={'wait': 0.5})

def parse_item(self, response):
    # item parser

代码将仅针对start_urls运行，但如果我在规则中注释restricted_xpaths方法和行start_requests()，则不会遵循process_request='start_requests',中指定的链接，它将运行并跟随预期的链接，当然没有js渲染。

我已阅读了CrawlSpider with Splash getting stuck after first URL和CrawlSpider with Splash两个相关问题，并在scrapy.Request()方法中专门更改了SplashRequest()到start_requests()，但这似乎不太合适上班。我的代码出了什么问题？谢谢，

Answer 1

我遇到了一个类似的问题，似乎特定于将Splash与Scrapy CrawlSpider集成。它只访问启动URL然后关闭。我设法让它工作的唯一方法是不使用scrapy-splash插件，而是使用'process_links'方法将Splash http api url预先添加到scrapy收集的所有链接。然后我做了其他调整以补偿这种方法产生的新问题。这是我做的：

你需要使用这两个工具来组合启动网址，然后将它拆开，如果你打算将它存放在某个地方。

from urllib.parse import urlencode, parse_qs

将启动网址预先添加到每个链接，scrapy会将它们全部过滤为“非网站域请求”，因此我们将“localhost”设为允许的域。

allowed_domains = ['localhost']
start_urls = ['https://www.example.com/']

但是，这会带来一个问题，因为当我们只想抓取一个网站时，我们可能会无休止地抓取网页。让我们用LinkExtractor规则解决这个问题。只需从我们想要的域中抓取链接，我们就可以解决异地请求问题。

LinkExtractor(allow=r'(http(s)?://)?(.*\.)?{}.*'.format(r'example.com')),
process_links='process_links',

这是process_links方法。 urlencode方法中的字典是您放置所有启动参数的地方。

def process_links(self, links):
    for link in links:
        if "http://localhost:8050/render.html?&" not in link.url:
            link.url = "http://localhost:8050/render.html?&" + urlencode({'url':link.url,
                                                                          'wait':2.0})
    return links

最后，要将url从splash url中取出，请使用parse_qs方法。

parse_qs(response.url)['url'][0]

关于这种方法的最后一点说明。你会注意到我有'＆amp;'在开始时的启动URL中。（... render.html的＆安培;吗）。这使得在使用urlencode方法时，无论您使用参数的顺序如何，都可以解析启动URL以取出实际的url。

Answer 2

似乎与https://github.com/scrapy-plugins/scrapy-splash/issues/92

有关

Personnaly我使用dont_process_response = True所以响应是HtmlResponse（这是_request_to_follows中代码所需的）。

我还在我的spyder中重新定义了_build_request方法，如下所示：

def _build_request(self, rule, link):
    r = SplashRequest(url=link.url, callback=self._response_downloaded, args={'wait': 0.5}, dont_process_response=True)
    r.meta.update(rule=rule, link_text=link.text)
    return r

在github问题中，一些用户只是重新定义了他们班级中的_request_to_follow方法。

Answer 3

使用以下代码 - 只需复制并粘贴

即可

restrict_xpaths=('//a[contains(text(), "Next Page")]')

而不是

restrict_xpaths=("//a[contains(text(), 'Next Page')]")

Scrapy CrawlSpider + Splash：如何通过linkextractor关注链接？

3 个答案: