写入CSV文件时出现Unicode问题

时间:2016-09-20 17:38:59

标签: python python-3.x unicode

我需要一些指导。我使用以下代码:

import requests
import bs4
import csv

results = requests.get('http://grad-schools.usnews.rankingsandreviews.com/best-graduate-schools/top-engineering-schools/eng-rankings?int=a74509')

reqSoup = bs4.BeautifulSoup(results.text, "html.parser")
i = 0
schools = []

for school in reqSoup:
    x = reqSoup.find_all("a", {"class" : "school-name"})
    while i < len(x):
        for name in x:
            y = x[i].get_text()
            i += 1
            schools.append(y)

with open('usnwr_schools.csv', 'wb') as f:
    writer = csv.writer(f)
        for y in schools:
        writer.writerow([y])

我的问题是em-dashes在生成的CSV文件中显示为utf-8。我已经尝试了几种不同的方法来修复它,但似乎没有任何工作(包括attempting to use regex去除它,以及尝试几年前的.translate method that I found in a StackOverflow问题)。

我错过了什么?我希望csv结果只包括文本,减去破折号。

我使用的是Python 3.5,而且对Python来说相当新。

2 个答案:

答案 0 :(得分:1)

要删除破折号,请尝试<html> <head> </head> <body> <script> var s = document.createElement('script'); s.async = true; s.src = '//thescript.js'; var s0 = document.getElementsByTagName('script')[0]; s0.parentNode.insertBefore(s, s0); doThisAPI(); </script> </body> </html> (第一个是em-dash到减号,第二个是减号到减号)

如果您只想要ASCII码点,可以使用

删除其他所有内容
doThisAPI()

(这仅对纯英文文本产生大部分合理的结果)

顺便一下,尝试使用

y.replace("—","-").replace("–","-")

因为在Python 3 import string whitelist=string.printable+string.whitespace def clean(s): return "".join(c for c in s if c in whitelist) 中文本文件不是二进制文件,就像它在Python 2中那样(你以二进制模式打开它(open('usnwr_schools.csv', 'w', newline='', encoding='utf-8') # or whatever encoding you like ))

答案 1 :(得分:0)

学习拥抱Unicode ......世界不再是ASCII。

假设您在Windows上并使用Excel或记事本查看.CSV,请在Python 3上使用以下行。只有此更改(并修复帖子的缩进),您甚至可以查看非ASCII字符正确。记事本和Excel就像文件开头的UTF-8 BOM签名一样,utf-8-sig提供。

with open('usnwr_schools.csv', 'w', newline='', encoding='utf-8-sig') as f:

如果在另一个Python脚本中读取该文件,请确保使用以下内容读取该文件。您阅读b'University of Michigan\xe2\x80\x94\xe2\x80\x8bAnn Arbor'的内容的示例是以二进制模式'rb'读取的。

with open('usnwr_schools.csv', encoding='utf-8-sig') as f:

如果在Linux上,您可以使用utf8代替utf-8-sig

顺便说一下,你可以用以下代码替换你的循环:

with open('usnwr_schools.csv', 'w', newline='', encoding='utf-8-sig') as f:
    writer = csv.writer(f)
    for school in reqSoup:
        x = reqSoup.find_all("a", {"class" : "school-name"})
        for item in x:
            y = item.get_text()
            writer.writerow([y])

读回来:

with open('usnwr_schools.csv',encoding='utf-8-sig') as f:
    print(f.read())

输出:

Massachusetts Institute of Technology
Stanford University
University of California—​Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—​Ann Arbor
Georgia Institute of Technology
University of Illinois—​Urbana-​Champaign
Purdue University—​West Lafayette
University of Texas—​Austin (Cockrell)
Texas A&M; University—​College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—​Los Angeles (Samueli)
University of California—​San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—​Santa Barbara
Harvard University
University of Maryland—​College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—​Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—​Ann Arbor
Georgia Institute of Technology
University of Illinois—​Urbana-​Champaign
Purdue University—​West Lafayette
University of Texas—​Austin (Cockrell)
Texas A&M; University—​College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—​Los Angeles (Samueli)
University of California—​San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—​Santa Barbara
Harvard University
University of Maryland—​College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—​Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—​Ann Arbor
Georgia Institute of Technology
University of Illinois—​Urbana-​Champaign
Purdue University—​West Lafayette
University of Texas—​Austin (Cockrell)
Texas A&M; University—​College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—​Los Angeles (Samueli)
University of California—​San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—​Santa Barbara
Harvard University
University of Maryland—​College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—​Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—​Ann Arbor
Georgia Institute of Technology
University of Illinois—​Urbana-​Champaign
Purdue University—​West Lafayette
University of Texas—​Austin (Cockrell)
Texas A&M; University—​College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—​Los Angeles (Samueli)
University of California—​San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—​Santa Barbara
Harvard University
University of Maryland—​College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—​Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—​Ann Arbor
Georgia Institute of Technology
University of Illinois—​Urbana-​Champaign
Purdue University—​West Lafayette
University of Texas—​Austin (Cockrell)
Texas A&M; University—​College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—​Los Angeles (Samueli)
University of California—​San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—​Santa Barbara
Harvard University
University of Maryland—​College Park (Clark)
University of Washington

如果你仍然只想成为ASCII,那就可以了:

import requests
import bs4
import csv

results = requests.get('http://grad-schools.usnews.rankingsandreviews.com/best-graduate-schools/top-engineering-schools/eng-rankings?int=a74509')

replacements = {ord('\N{EN DASH}'):'-',
                ord('\N{EM DASH}'):'-',
                ord('\N{ZERO WIDTH SPACE}'):None}

reqSoup = bs4.BeautifulSoup(results.text, "html.parser")

with open('usnwr_schools.csv', 'w', newline='', encoding='ascii') as f:
    writer = csv.writer(f)
    for school in reqSoup:
        x = reqSoup.find_all("a", {"class" : "school-name"})
        for item in x:
            y = item.get_text()
            writer.writerow([y.translate(replacements)])

with open('usnwr_schools.csv',encoding='ascii') as f:
    print(f.read())