我正试图从网站的表格中删除一些信息。我的代码如下:
import csv
import bs4 as bs
import requests
from lxml import html
url = "http://resources.afaqs.com/index.html?id={}&category=AD+Agencies&alpha="
data = []
for number in range(1,100):
soup = url.format(number)
r = requests.get(soup)
tree = html.fromstring(r.content)
legalname = tree.xpath('//h2[@itemprop="legalname"]/text()')
ownername = tree.xpath('//td[@itemprop="name"]/text()')
locality = tree.xpath('//td[@itemprop="addressLocality"]/text()')
pincode = tree.xpath('//span[@itemprop="postalCode"]/text()')
addressregion = tree.xpath('//span[@itemprop="addressRegion"]/text()')
telephone = tree.xpath('//span[@itemprop="telephone"]/text()')
fax = tree.xpath('//span[@itemprop="faxNumber"]/text()')
email = tree.xpath('//a[starts-with(@href, "mailto")]/text()')
legalname = [unicode(i) for i in legalname]
ownername = [unicode(i) for i in ownername]
locality = [unicode(i) for i in locality]
pincode = [unicode(i) for i in pincode]
addressregion = [unicode(i) for i in addressregion]
telephone = [unicode(i) for i in telephone]
fax = [unicode(i) for i in fax]
email = [unicode(i) for i in email]
data = {"legalname" : [legalname], "ownername" : [ownername], "locality": [locality], "pincode" : [pincode], "addressregion" : [addressregion], "email": [email]}
with open('output.csv','a') as file:
writer=csv.writer(file)
writer.writerow(['col1', 'col2'])
for key in sorted(data.keys()):
writer.writerow([key]+data[key])
每当此代码遇到unicode错误值时,都会返回错误。我试图将文本转换为unicode,但它无法正常工作。我经常收到以下错误:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 14: invalid start byte
我试图转换
tree = html.fromstring(r.content)
添加
myparser = etree.HTMLParser(encoding="utf-8")
tree = html.fromstring(r.content, parser=myparser)
如何将xpath文本值转换为utf-8,以便可以提取数据。