Question

我做网络爬行使用scrapy。目前，它可以提取起始网址，但不能在以后抓取。

 start_urls = ['https://cloud.cubecontentgovernance.com/retention/document_types.aspx']

allowed_domains = ['cubecontentgovernance.com']
rules = (
     Rule(LinkExtractor(allow=("document_type_retention.aspx?dtid=1054456",)),
         callback='parse_item', follow=True),
)

我想在开发工具中提取的链接是：<a id="ctl00_body_ListView1_ctrl0_hyperNameLink" href="document_type_retention.aspx?dtid=1054456"> pricing </a>

相应的网址为https://cloud.cubecontentgovernance.com/retention/document_type_retention.aspx?dtid=1054456

那么允许字段应该是什么？非常感谢

Answer 1

当我尝试打开您的起始网址时，我会看到一个登录窗口。

您是否尝试使用简单的print response.body方法parse作为起始网址？我猜你的Scrapy实例获得了相同的登录窗口，该窗口没有你要用LinkExtractor提取的URL。

网络爬行中的正则表达式和scrapy

1 个答案: