如何确定Scrapy中的链接是否为nofollow或dofollow?

时间:2018-01-02 04:30:27

标签: python scrapy

所以,这就是问题所在。我有一个Scrapy bot,它遵循给定站点的内部链接,将其链接,状态代码和锚文本写入数据库。但我正在努力抓住这个链接的关注状态。有没有办法获取rel=nofollow/dofollow信息?如果有人想知道那是我的代码;

class MySpider(CrawlSpider):
    name = 'spydiiiii'
    start_urls = [urlToScrape]
    rules = (
        Rule (
            LxmlLinkExtractor(
                allow=(urlToScrape),
                deny=(
                    "google.com",
                    "facebook.com",
                    "pinterest.com",
                    "facebook.com",
                    "digg.com",
                    "twitter.com",
                    "stumbleupon.com",
                    "linkedin.com"
                ),
                unique=True

            ),
        callback="parse_items",
        follow= True,
        ),
    )



    def parse_items(self, response):
        sel = Selector(response)
        items = []
        item = InternallinkItem()


        referring_url = response.request.headers.get('Referer').decode('utf-8')
        item["referring_url"] = referring_url

        anchor = response.meta.get('link_text')
        item["anchor_text"] = " ".join(anchor.split())

        item["current_url"] = response.url

        item['status'] = response.status

        items.append(item)
        return items

提前致谢

1 个答案:

答案 0 :(得分:1)

我手动使用LxmlLinkExtractor来获取具有Link信息的nofollow个对象。

parse()我从第一页获取链接并使用'nofollow'(和其他)信息创建项目,并使用Requests使用此网址(并在{{1中使用item获取metastatus

referer使用Requestsparse_item()获取item并添加meta

status还使用提取器在此页面上获取新链接并创建新的parse_item()并再次item执行Requests

parse_item()

修改

因为我不知道任何包含import scrapy from scrapy.http import Request from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor class MySpider(scrapy.Spider): name = 'myspider' #allowed_domains = ['http://quotes.toscrape.com'] start_urls = ['http://quotes.toscrape.com'] #start_urls = ['http://127.0.0.1:5000/'] # for Flask example extractor = LxmlLinkExtractor( allow=('http://quotes.toscrape.com'), #allow=('http://127.0.0.1:5000'), # for Flask example deny=( 'google.com', 'facebook.com', 'pinterest.com', 'facebook.com', 'digg.com', 'twitter.com', 'stumbleupon.com', 'linkedin.com' ), unique=True, ) def parse(self, response): print('parse url:', response.url) # use LxmlLinkExtractor manually for link in self.extractor.extract_links(response): #print('link:', link) item = {} item['nofollow'] = link.nofollow item['anchor_text'] = link.text item['current_url'] = link.url #item['referring_url'] = response.url yield Request(link.url, meta={'item': item}, callback=self.parse_item) def parse_item(self, response): print('parse_item url:', response.url) item = response.meta['item'] item['referring_url'] = response.request.headers.get('Referer') #item['referring_url'] = response.request.url item['status'] = response.status yield item # use LxmlLinkExtractor manually with new links for link in self.extractor.extract_links(response): #print('link:', link) item = {} item['nofollow'] = link.nofollow item['anchor_text'] = link.text item['current_url'] = link.url #item['referring_url'] = response.url yield Request(link.url, meta={'item': item}, callback=self.parse_item) # --- run spider without project --- from scrapy.crawler import CrawlerProcess c = CrawlerProcess({ 'USER_AGENT': 'Mozilla/5.0', 'FEED_FORMAT': 'csv', 'FEED_URI': 'output.csv', }) c.crawl(MySpider) c.start() 的网页,所以我在rel="nofollow"中创建了简单的代码来测试代码。

Flask