Question

我想创建一个从url（page1）开始的抓取工具，并跟随指向新页面的链接page2。在第2页上，它应该跟随指向第3页的链接。然后我想在第3页抓一些数据。

但是，我是一个刮胡子的菜鸟，无法让回调功能正常工作。这是我的代码：

class allabolagnewspider(CrawlSpider):
name="allabolagnewspider"
# allowed_domains = ["byralistan.se"]
start_urls = [
    "http://www.allabolag.se/5565794400/befattningar"
]

rules = (
    Rule(LinkExtractor(allow = "http://www.allabolag.se",
                       restrict_xpaths=('//*[@id="printContent"]//a[1]'),
                       canonicalize=False),
         callback='parse_link1'),
)

def parse_link1(self, response):
    hxs = HtmlXPathSelector(response)
    return Request(hxs.xpath('//*[@id="printContent"]/div[2]/table/tbody/tr[4]/td/table/tbody/tr/td[2]/a').extract(), callback=self.parse_link2)

def parse_link2(self, response):
    for sel in response.xpath('//*[@id="printContent"]'):
        item = AllabolagnewItem()
        item['Byra'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
        item['Namn'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
        item['Gender'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
        item['Alder'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
        yield item

但是，当我运行它时，我收到以下错误消息：＆＃34; TypeError：请求网址必须是str或unicode，得到列表：＆＃34;

如果我把它弄好了，当我尝试返回我对parse_link1的请求时，我就搞砸了。我该怎么办？

编辑：

这里是工作代码（但仍然遇到了一些问题，但具体问题已经解决）：

class allabolagnewspider(CrawlSpider):
name="allabolagnewspider"
# allowed_domains = ["byralistan.se"]
start_urls = [
    "http://www.allabolag.se/5565794400/befattningar"
]

rules = (
    Rule(LinkExtractor(allow = "http://www.allabolag.se",
                       restrict_xpaths=('//*[@id="printContent"]//a[2]'),
                       canonicalize=False),
         callback='parse_link1'),
)

def parse_link1(self, response):
    for href in response.xpath('''//*[@id="printContent"]/div[2]/table//tr[4]/td/table//tr/td[2]/a/@href''').extract():
        print "hey"
        yield Request(response.urljoin(href), callback=self.parse_link2)

def parse_link2(self, response):
    for sel in response.xpath('//*[@id="printContent"]'):
        print "hey2"
        item = AllabolagnewItem()
        item['Byra'] = sel.xpath('./div[2]/table//tr[3]/td/h1/text()').extract()
        item['Namn'] = sel.xpath('./div[2]/table//tr[3]/td/h1/text()').extract()
        item['Gender'] = sel.xpath('./div[2]/table//tr[7]/td/table[1]//tr[1]/td/text()').extract()
        item['Alder'] = sel.xpath('./div[2]/table//tr[3]/td/h1/text()').extract()
        yield item

Answer 1

在parse_link1中，您正在传递一个列表，.extract()的结果在SelectorList上（.xpath()上的hxs调用结果选择器），作为url的值，Request构造函数的第一个参数，而预期单个值。

改为使用.extract_first()：

return Request(hxs.xpath('//*[@id="printContent"]/div[2]/table/tbody/tr[4]/td/table/tbody/tr/td[2]/a').extract_first()

在OP对

的评论后进行编辑

"TypeError: Request url must be str or unicode, got NoneType:"

这是因为过于保守的＃34; XPath表达式，可能由您的浏览器提供检查工具我想（我在Chrome中测试了您的XPath，它适用于this example page）

麻烦在于.../table/tbody/tr/...。事情是<tbody>很少出现在人们甚至模板（由人写的）编写的真实HTML页面上。 HTML希望<table>拥有<tbody>但没有人真正关心，并且浏览器处理得很好（并且他们注入了缺少的<tbody>元素来托管<tr>行。）

所以，虽然它并不完全等同于XPath，但它通常很好：

省略tbody/并使用table/tr模式
或使用table//tr

使用scrapy shell：

查看其实际操作

$ scrapy shell http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan/f6da68933af6383498691f19de7ebd4b
>>>
>>> # with XPath from browser tool (I assume), you get nothing for the "real" downloaded HTML 
>>> response.xpath('//*[@id="printContent"]/div[2]/table/tbody/tr[4]/td/table/tbody/tr/td[2]/a')
[]
>>>
>>> # or, omitting `tbody/`
>>> response.xpath('//*[@id="printContent"]/div[2]/table/tr[4]/td/table/tr/td[2]/a')
[<Selector xpath='//*[@id="printContent"]/div[2]/table/tr[4]/td/table/tr/td[2]/a' data=u'<a href="/befattningshavare/de_Sauvage-N'>]

>>> # replacing "/table/tbody/" with "/table//" (tbody is added by browser to have "correct DOM tree")
>>> response.xpath('//*[@id="printContent"]/div[2]/table//tr[4]/td/table//tr/td[2]/a')
[<Selector xpath='//*[@id="printContent"]/div[2]/table//tr[4]/td/table//tr/td[2]/a' data=u'<a href="/befattningshavare/de_Sauvage-N'>]
>>>
>>> # suggestion: use the <img> tag after the <a> as predicate
>>> response.xpath('//*[@id="printContent"]/div[2]/table//tr/td/table//tr/td/a[img/@alt="personprofil"]')
[<Selector xpath='//*[@id="printContent"]/div[2]/table//tr/td/table//tr/td/a[img/@alt="personprofil"]' data=u'<a href="/befattningshavare/de_Sauvage-N'>]
>>>

另外，您需要：

获得＆＃34; href＆＃34;属性值（在XPath末尾添加@href）
构建一个绝对URL。 response.urljoin()是此

继续scrapy shell：

>>> response.xpath('//*[@id="printContent"]/div[2]/table/tr[4]/td/table/tr/td[2]/a/@href').extract_first()
u'/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b'
>>> response.urljoin(u'/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b')
u'http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b'
>>>

最后，您的回调可能会变为：

def parse_link1(self, response):
    # .extract() returns a list here, after .xpath()
    # so you can loop, even if you have 1 result
    #
    # XPaths can be multiline, it's easier to read for long expressions
    for href in response.xpath('''
        //*[@id="printContent"]
           /div[2]
            /table//tr[4]/td
             /table//tr/td[2]/a/@href''').extract():
        yield Request(response.urljoin(href),
                      callback=self.parse_link2)

Answer 2

hxs.xpath(...).extract()返回一个列表而不是字符串。尝试迭代生成请求的列表，或从列表中选择所需的正确URL。

之后，仅当页面中的链接是绝对路径时才会起作用。如果它们是相对的，则需要构建绝对路径。

Scrapy回调函数，如何解析几个页面？

2 个答案: