Question

确实，我详细阅读了这个问题converting scrapy to lxml。但是在我的项目中，有数十个爬虫使用了scrapy选择器。我们花了很多时间将scrapy逐行转换为lxml。所以我尝试编写一些兼容的代码来迁移抓取工具。

class ElemList(list):
    def __init__(self, elem_list=[]):
        super(ElemList, self).__init__(elem_list)

    def xpath(self, xpath_str=""):
        res = []
        for elem in self:
            try:
                e = elem.xpath(xpath_str)
            except Exception as e:
                continue
            if isinstance(e, str) or isinstance(e, unicode):
                res.append(e)
            else:
                res.extend(e)
        return ElemList(res)

    def extract(self):
        res = []
        for elem in self:
            if isinstance(elem, str):
                res.append(elem)
        return res

在响应类中，添加一些init调用。

from lxml import etree

class Response(object):
    def __init__(self):
        self.elem_list = ElemList(etree.HTML(self.html))
    def xpath(self, xpath):
        return self.elem_list.xpath(xpath)

通过这个类，我可以像这样调用响应对象：

resp.xpath('//h2[@class="user-card-name"]/text()').extract()
resp.xpath('//h2[@class="user-card-name"]').xpath('*[@class="top-badge"]/a/@href').extract()

有效。但是出现了新的问题，我怎样才能像这样迁移response.css？

baseInfo_div = response.css(".vcard")[0]
baseInfo_div.css(".vcard-fullname")
baseInfo_div.css(".vcard-username")
baseInfo_div.css('li[itemprop="worksFor"]')
baseInfo_div.css('li[itemprop="homeLocation"]')

Answer 1

您可以尝试使用cssselect()中的lxml.cssselect方法实现逻辑，这样您就可以使用CSS选择器表达式从lxml的{{1}}对象进行查询。或者，您可以使用Element：

将CSS选择器转换为XPath选择器

GenericTranslator.css_to_xpath()

scrapy响应选择器到lxml

1 个答案: