应用错误收集

Scrapy / LxmlLinkExtractor连续追加相同的url路径

时间：2018-02-19 05:06:10

标签： python scrapy

当我尝试从网页中提取链接时，我得到以下网址。 Scrapy / LxmlLinkExtractor一次又一次地无限追加url路径的一部分。我该如何解决这个问题？

＆＃39; http://blogs.wsj.com/metropolis/2011/01/26/weather-journal-more-snow-than-expected-in-new-york-area/tab/2011/01/26/weather-journal-more-snow-than-expected-in-new-york-area/tab/2011/01/25/2011/01/25/2011/01/26/weather-journal-more-snow-than-expected-in-new-york-area/tab/2011/01/25/2011/01/26/weather-journal-more-snow-than-expected-in-new-york-area/tab/2011/01/26/weather-journal-more-snow-than-expected-in-new-york-area/tab/2011/01/26/weather-journal-more-snow-than-expected-in-new-york-area/tab/2011/01/25/2011/01/26/weather-journal-more-snow-than-expected-in-new-york-area/ ....＆＃39;

我使用scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor来提取链接。

lxml_link_extractor = LxmlLinkExtractor(allow_domains=['wsj.com'] ) lxml_link_extractor.extract_links(response) #response是从中间件返回给蜘蛛的标准响应

0 个答案:

没有答案