Question

我正试图从网站的表格中删除一些信息。我的代码如下：

import csv
import bs4 as bs
import requests
from lxml import html


url = "http://resources.afaqs.com/index.html?id={}&category=AD+Agencies&alpha="
data = []

for number in range(1,100):
    soup = url.format(number)
    r = requests.get(soup)
    tree = html.fromstring(r.content)

    legalname = tree.xpath('//h2[@itemprop="legalname"]/text()')
    ownername = tree.xpath('//td[@itemprop="name"]/text()')
    locality = tree.xpath('//td[@itemprop="addressLocality"]/text()')
    pincode = tree.xpath('//span[@itemprop="postalCode"]/text()')
    addressregion = tree.xpath('//span[@itemprop="addressRegion"]/text()')
    telephone = tree.xpath('//span[@itemprop="telephone"]/text()')
    fax = tree.xpath('//span[@itemprop="faxNumber"]/text()')
    email = tree.xpath('//a[starts-with(@href, "mailto")]/text()')

    legalname = [unicode(i) for i in legalname]
    ownername = [unicode(i) for i in ownername]
    locality = [unicode(i) for i in locality]
    pincode = [unicode(i) for i in pincode]
    addressregion = [unicode(i) for i in addressregion]
    telephone = [unicode(i) for i in telephone]
    fax = [unicode(i) for i in fax]
    email = [unicode(i) for i in email]

    data = {"legalname" : [legalname], "ownername" : [ownername], "locality": [locality], "pincode" : [pincode], "addressregion" : [addressregion],  "email": [email]}
    with open('output.csv','a') as file:
        writer=csv.writer(file)
        writer.writerow(['col1', 'col2'])
        for key in sorted(data.keys()):
            writer.writerow([key]+data[key])

每当此代码遇到unicode错误值时，都会返回错误。我试图将文本转换为unicode，但它无法正常工作。我经常收到以下错误：

UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 14: invalid start byte

我试图转换

tree = html.fromstring(r.content)

添加

myparser = etree.HTMLParser(encoding="utf-8")
tree = html.fromstring(r.content, parser=myparser)

如何将xpath文本值转换为utf-8，以便可以提取数据。

使用python解析带有unicode解码错误的内容

0 个答案: