我有一个简单的脚本来从公共站点抓取一些信息,然后将数据附加到CSV文件:
import requests
import base64
import csv
from lxml import html
from lxml import etree
print (csv.list_dialects())
startUrl = "http://example.com?page="
#max. 964
for i in range (1,20):
print (i)
page = requests.get(startUrl+str(i))
tree = html.fromstring(page.content)
for element in tree.xpath('//*[@class="std-link std-link--unobtrusive std-link--visitable std-bold"]/@href'):
subpage = requests.get(element)
subtree = html.fromstring(subpage.content)
study = subtree.xpath('//*[@class="std-profileHero__headline"]/h1/text()')
uni = subtree.xpath('//*[@class="std-headline std-headline--h3"]/a/text()')
if study:
study = study[0].replace("\n"," ").replace("\t"," ")
study = str(study.encode("utf-8")).strip()
else:
study = "-"
if uni:
uni = uni[0].replace("\n"," ").replace("\t"," ")
uni = str(uni.encode("utf-8")).strip()
else:
uni = "-"
with open("results.csv", "a", newline="", encoding="utf-8") as csv_file:
writer = csv.writer(csv_file, delimiter=";")
writer.writerow([uni, study])
该脚本有效,但存储在CSV中的信息存在ecoding问题,所以我得到这样的值:
在MS Excel 2016中打开CSV时,会保留这些值。
如您所见,脚本对字符串进行编码:.encode("utf-8")
。我还确保CSF文件已编码:encoding="utf-8"
。
我尝试不来使用encode()
函数,但是编码只是打破了德语字符,例如ü,ä等等。
我做错了什么?
答案 0 :(得分:1)
出现奇怪值的原因是您使用的是str(b'bytes')
而不是str(b'bytes', encoding)
,其行为类似于repr(b'bytes')
并且为您"b'bytes'"
而不是{"bytes"
1}}。
因此,您应该完全在str
个对象上或完全在bytes
个对象上操作 。