无法正确UTF-8编码存储在CSV中的数据

时间:2018-02-23 10:19:36

标签: python excel csv

我有一个简单的脚本来从公共站点抓取一些信息,然后将数据附加到CSV文件:

import requests
import base64
import csv
from lxml import html
from lxml import etree

print (csv.list_dialects())

startUrl = "http://example.com?page="
#max. 964
for i in range (1,20):
    print (i)
    page = requests.get(startUrl+str(i))
    tree = html.fromstring(page.content)
    for element in tree.xpath('//*[@class="std-link std-link--unobtrusive std-link--visitable std-bold"]/@href'):
            subpage = requests.get(element)
            subtree = html.fromstring(subpage.content)
            study = subtree.xpath('//*[@class="std-profileHero__headline"]/h1/text()')
            uni = subtree.xpath('//*[@class="std-headline std-headline--h3"]/a/text()')
            if study:
                study = study[0].replace("\n"," ").replace("\t"," ")
                study = str(study.encode("utf-8")).strip()
            else:
                study = "-"

            if uni:
                uni = uni[0].replace("\n"," ").replace("\t"," ")
                uni = str(uni.encode("utf-8")).strip()
            else:
                uni = "-"   

            with open("results.csv", "a", newline="", encoding="utf-8") as csv_file:
                writer = csv.writer(csv_file, delimiter=";")
                writer.writerow([uni, study])

该脚本有效,但存储在CSV中的信息存在ecoding问题,所以我得到这样的值:

  • b'Cat \ xc3 \ xb3lica里斯本商学院
  • b'Universit \ xc3 \ xa4t Augsburg'
  • b'Software Engineering'

在MS Excel 2016中打开CSV时,会保留这些值。

如您所见,脚本对字符串进行编码:.encode("utf-8")。我还确保CSF文件已编码:encoding="utf-8"

我尝试来使用encode()函数,但是编码只是打破了德语字符,例如üä等等。

我做错了什么?

1 个答案:

答案 0 :(得分:1)

出现奇怪值的原因是您使用的是str(b'bytes')而不是str(b'bytes', encoding),其行为类似于repr(b'bytes')并且为您"b'bytes'"而不是{"bytes" 1}}。

因此,您应该完全在str个对象上或完全在bytes个对象上操作