Question

我正在使用Scrapy编写一个刮刀。我希望它做的一件事是比较当前网页的根域和其中链接的根域。如果这些域不同，则必须继续提取数据。这是我目前的代码：

class MySpider(Spider):
    name = 'smm'
    allowed_domains = ['*']
    start_urls = ['http://en.wikipedia.org/wiki/Social_media']
    def parse(self, response):
        items = []
        for link in response.xpath("//a"):
            #Extract the root domain for the main website from the canonical URL
            hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract()
            hostname1 = urlparse(hostname1).hostname
            #Extract the root domain for thelink
            hostname2 = link.xpath('@href').extract()
            hostname2 = urlparse(hostname2).hostname
            #Compare if the root domain of the website and the root domain of the link are different.
            #If so, extract the items & build the dictionary 
            if hostname1 != hostname2:
                item = SocialMediaItem()
                item['SourceTitle'] = link.xpath('/html/head/title').extract()
                item['TargetTitle'] = link.xpath('text()').extract()
                item['link'] = link.xpath('@href').extract()
                items.append(item)
        return items

然而，当我运行它时，我收到此错误：

Traceback (most recent call last):
  File "C:\Anaconda\lib\site-packages\twisted\internet\base.py", line 1201, in mainLoop
    self.runUntilCurrent()
  File "C:\Anaconda\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 382, in callback
    self._startRunCallbacks(result)
  File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 490, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "C:\Anaconda\lib\site-packages\twisted\internet\defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "E:\Usuarios\Daniel\GitHub\SocialMedia-Web-Scraper\socialmedia\socialmedia\spiders\SocialMedia.py", line 16, in parse
    hostname1 = urlparse(hostname1).hostname
  File "C:\Anaconda\lib\urlparse.py", line 143, in urlparse
    tuple = urlsplit(url, scheme, allow_fragments)
  File "C:\Anaconda\lib\urlparse.py", line 176, in urlsplit
    cached = _parse_cache.get(key, None)
exceptions.TypeError: unhashable type: 'list'

任何人都可以帮我摆脱这个错误吗？我认为它与列表键有关，但我不知道如何解决它。非常感谢你！

达尼

Answer 1

这里有一些问题：

不需要在循环中计算hostname1，因为它总是选择相同的rel元素，即使在子选择器上使用（由于xpath的性质）表达式，这是绝对的而不是相对的，但这是你需要它的方式）。
hostname1的xpath表达式格式错误，它返回None，因此在尝试获取Kevin提出的第一个元素时出错。表达式中有两个单qoutes，而不是一个转义的单引号或双引号。
当您应该获取其rel属性时，您将获得@href元素。应该更改XPath表达式以反映这一点。

解决这些问题之后，代码看起来像这样（未经测试）：

    def parse(self, response):
        items = []
        hostname1 = response.xpath("/html/head/link[@rel='canonical']/@href").extract()[0]
        hostname1 = urlparse(hostname1).hostname

        for link in response.xpath("//a"):
            hostname2 = (link.xpath('@href').extract() or [''])[0]
            hostname2 = urlparse(hostname2).hostname
            #Compare and extract
            if hostname1 != hostname2:
                ...
        return items

Answer 2

hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract()
hostname1 = urlparse(hostname1).hostname

extract返回字符串列表，但urlparse只接受一个字符串。也许你应该丢弃除了找到的第一个主机名以外的所有主机名。

hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract()[0]
hostname1 = urlparse(hostname1).hostname

同样对于其他主机名。

hostname2 = link.xpath('@href').extract()[0]
hostname2 = urlparse(hostname2).hostname

如果您不确定文档是否包含主机名，那么在跳跃之前查看可能会很有用。

hostname1 = link.xpath('/html/head/link[@rel=''canonical'']').extract()
if not hostname1: continue
hostname1 = urlparse(hostname1[0]).hostname

hostname2 = link.xpath('@href').extract()
if not hostname2: continue
hostname2 = urlparse(hostname2[0]).hostname

如何摆脱exceptions.TypeError错误？

2 个答案: