如何在我的所有网页中抓取链接?

时间:2018-04-12 17:34:14

标签: python scrapy screen-scraping

到目前为止,我有这个代码,使用scrapy从页面URL中提取文本:

{{1}}

如何从这些页面上的链接中提取数据并将其写入我创建的文件名?

1 个答案:

答案 0 :(得分:1)

您可以使用CrawlSpider提取每个链接并抓取它们,您的代码可能如下所示

from scrapy.linkextractors import LinkExtractor
from scrapy.spider import CrawlSpider, Rule


class QuotesSpider(CrawlSpider):
    name = "dialpad"

    start_urls = [
        'https://help.dialpad.com/hc/en-us/categories/201278063-User-Support',
        'https://www.domo.com/',
        'https://www.zenreach.com/',
        'https://www.trendkite.com/',
        'https://peloton.com/',
        'https://ting.com/',
        'https://www.cedar.com/',
        'https://tophat.com/',
        'https://www.bambora.com/en/ca/',
        'https://www.hoteltonight.com/'
    ]

    rules = [
        Rule(
            LinkExtractor(
                allow=(r'url patterns here to follow'),
                deny=(r'other url patterns to deny'),
            ),
            callback='parse_item',
            follow=True,
        )
    ]

    def parse_item(self, response):
        page = response.url.split("/")[2]
        filename = 'quotes-thing-{}.csv'.format(page)

        with open(filename, 'w') as f:
            for selector in response.css('body').xpath('.//text()'):
                selector = selector.extract()
                f.write(selector)

虽然我建议为每个网站创建不同的蜘蛛,然后使用allowdeny参数选择您希望在每个网站上提取的链接。

使用Scrapy Items

也会好得多