Question

我有一个简单的脚本来从公共站点抓取一些信息，然后将数据附加到CSV文件：

import requests
import base64
import csv
from lxml import html
from lxml import etree

print (csv.list_dialects())

startUrl = "http://example.com?page="
#max. 964
for i in range (1,20):
    print (i)
    page = requests.get(startUrl+str(i))
    tree = html.fromstring(page.content)
    for element in tree.xpath('//*[@class="std-link std-link--unobtrusive std-link--visitable std-bold"]/@href'):
            subpage = requests.get(element)
            subtree = html.fromstring(subpage.content)
            study = subtree.xpath('//*[@class="std-profileHero__headline"]/h1/text()')
            uni = subtree.xpath('//*[@class="std-headline std-headline--h3"]/a/text()')
            if study:
                study = study[0].replace("\n"," ").replace("\t"," ")
                study = str(study.encode("utf-8")).strip()
            else:
                study = "-"

            if uni:
                uni = uni[0].replace("\n"," ").replace("\t"," ")
                uni = str(uni.encode("utf-8")).strip()
            else:
                uni = "-"   

            with open("results.csv", "a", newline="", encoding="utf-8") as csv_file:
                writer = csv.writer(csv_file, delimiter=";")
                writer.writerow([uni, study])

该脚本有效，但存储在CSV中的信息存在ecoding问题，所以我得到这样的值：

b'Cat \ xc3 \ xb3lica里斯本商学院
b'Universit \ xc3 \ xa4t Augsburg'
b'Software Engineering'

在MS Excel 2016中打开CSV时，会保留这些值。

如您所见，脚本对字符串进行编码：.encode("utf-8")。我还确保CSF文件已编码：encoding="utf-8"。

我尝试不来使用encode()函数，但是编码只是打破了德语字符，例如ü，ä等等。

我做错了什么？

Answer 1

出现奇怪值的原因是您使用的是str(b'bytes')而不是str(b'bytes', encoding)，其行为类似于repr(b'bytes')并且为您"b'bytes'"而不是{"bytes" 1}}。

因此，您应该完全在str个对象上或完全在bytes个对象上操作。

无法正确UTF-8编码存储在CSV中的数据

1 个答案: