Question

我正在建造一个scrapy蜘蛛，但需要有效和正确的方法来剥离包含url的字符串。网址始终以 [＆＃39; u 开头，以＆＃39; 结尾例如[u＆＃39; http://example.com/2334878＆＃39;]

def parse(self, response):
    for sel in response.xpath("//div[@class='category']/a"):
        item = SpiderItem()
        item['title'] = sel.xpath('text()').extract()
        item['link'] = sel.xpath('@href').extract()
        linkToPost = str(item['link'])
        linkToPost = linkToPost.strip("['u")
        linkToPost = linkToPost.replace("'", "")
        linkToPost = linkToPost.replace("]", "")
        print linkToPost
        #Parse request to follow the posting link into the actual post
        request = scrapy.Request(linkToPost , callback=self.parse_item_page)
        request.meta['item'] = item
        yield request

Answer 1

这是因为extract()会返回列表：

extract()

序列化并返回匹配的节点作为列表   unicode字符串。编码内容百分比不加引号。

最多＆＃34;克莱克＆＃34;这里的方法是使用ItemLoader和TakeFirst或Join处理器。

或者，只需从列表中获取第一个元素：

item['title'] = sel.xpath('text()').extract()[0]
item['link'] = sel.xpath('@href').extract()[0]

如何在Python中为scrapy bot剥离具有不同目的的字符串？

1 个答案: