Question

我很难理解链接模式在scrapy中是如何工作的，我得到了一个例子。有没有人对如何写一个有任何想法？

def parse(self, response):

    hxs             = scrapy.Selector(response)
    links           = hxs.xpath("//a/@href").extract()        
    #We stored already crawled links in this list
    crawledLinks    = []



    #Pattern to check proper link
    linkPattern     = re.compile("^(?:ftp|http|https):\/\/(?:[\w\.\-\+]+:{0,1}[\w\.\-\+]*@)?(?:[a-z0-9\-\.]+)(?::[0-9]+)?(?:\/|\/(?:[\w#!:\.\?\+=&amp;%@!\-\/\(\)]+)|\?(?:[\w#!:\.\?\+=&amp;%@!\-\/\(\)]+))?$")



    for link in links:
        # If it is a proper link and is not checked yet, yield it to the Spider
        if linkPattern.match(link) and not link in crawledLinks:
            crawledLinks.append(link)
            yield Request(link, self.parse)


    item = MS_homeItem() 
    item['name'] = hxs.xpath('//*[@id="product-detail-page"]/li[4]/div/div[2]/h1').extract()
    yield item

任何帮助都会很棒谢谢James

Answer 1

Python支持一种非常常见的功能，称为regular expressions（正则表达式）。这是一个更简单的例子：

import re

strings = [
    "cat",
    "cut",
    "bird",
    "catnip",
    "cute",
]

pattern = r"c.t"
regex = re.compile(pattern)

for string in strings:
    if regex.match(string):
        result = 'yes'
    else:
        result = 'no'

    print("{} => {}".format(string, result))

--output:--
cat => yes
cut => yes
bird => no
catnip => yes
cute => yes

在模式中，.匹配任何字符。

linkPattern是一个很复杂的模式，可以匹配以下内容：

http://www.google.com

以及更复杂的网址。网址可能非常复杂，您可以阅读here。

我不明白如何为scrapy编写一个linkpattern

1 个答案: