我需要一些指导。我使用以下代码:
import requests
import bs4
import csv
results = requests.get('http://grad-schools.usnews.rankingsandreviews.com/best-graduate-schools/top-engineering-schools/eng-rankings?int=a74509')
reqSoup = bs4.BeautifulSoup(results.text, "html.parser")
i = 0
schools = []
for school in reqSoup:
x = reqSoup.find_all("a", {"class" : "school-name"})
while i < len(x):
for name in x:
y = x[i].get_text()
i += 1
schools.append(y)
with open('usnwr_schools.csv', 'wb') as f:
writer = csv.writer(f)
for y in schools:
writer.writerow([y])
我的问题是em-dashes在生成的CSV文件中显示为utf-8。我已经尝试了几种不同的方法来修复它,但似乎没有任何工作(包括attempting to use regex去除它,以及尝试几年前的.translate method that I found in a StackOverflow问题)。
我错过了什么?我希望csv结果只包括文本,减去破折号。
我使用的是Python 3.5,而且对Python来说相当新。
答案 0 :(得分:1)
要删除破折号,请尝试<html>
<head>
</head>
<body>
<script>
var s = document.createElement('script');
s.async = true;
s.src = '//thescript.js';
var s0 = document.getElementsByTagName('script')[0];
s0.parentNode.insertBefore(s, s0);
doThisAPI();
</script>
</body>
</html>
(第一个是em-dash到减号,第二个是减号到减号)
如果您只想要ASCII码点,可以使用
删除其他所有内容doThisAPI()
(这仅对纯英文文本产生大部分合理的结果)
顺便一下,尝试使用
y.replace("—","-").replace("–","-")
因为在Python 3 import string
whitelist=string.printable+string.whitespace
def clean(s):
return "".join(c for c in s if c in whitelist)
中文本文件不是二进制文件,就像它在Python 2中那样(你以二进制模式打开它(open('usnwr_schools.csv', 'w', newline='', encoding='utf-8') # or whatever encoding you like
))
答案 1 :(得分:0)
学习拥抱Unicode ......世界不再是ASCII。
假设您在Windows上并使用Excel或记事本查看.CSV,请在Python 3上使用以下行。只有此更改(并修复帖子的缩进),您甚至可以查看非ASCII字符正确。记事本和Excel就像文件开头的UTF-8 BOM签名一样,utf-8-sig
提供。
with open('usnwr_schools.csv', 'w', newline='', encoding='utf-8-sig') as f:
如果在另一个Python脚本中读取该文件,请确保使用以下内容读取该文件。您阅读b'University of Michigan\xe2\x80\x94\xe2\x80\x8bAnn Arbor'
的内容的示例是以二进制模式'rb'
读取的。
with open('usnwr_schools.csv', encoding='utf-8-sig') as f:
如果在Linux上,您可以使用utf8
代替utf-8-sig
。
顺便说一下,你可以用以下代码替换你的循环:
with open('usnwr_schools.csv', 'w', newline='', encoding='utf-8-sig') as f:
writer = csv.writer(f)
for school in reqSoup:
x = reqSoup.find_all("a", {"class" : "school-name"})
for item in x:
y = item.get_text()
writer.writerow([y])
读回来:
with open('usnwr_schools.csv',encoding='utf-8-sig') as f:
print(f.read())
输出:
Massachusetts Institute of Technology
Stanford University
University of California—Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—Ann Arbor
Georgia Institute of Technology
University of Illinois—Urbana-Champaign
Purdue University—West Lafayette
University of Texas—Austin (Cockrell)
Texas A&M; University—College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—Los Angeles (Samueli)
University of California—San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—Santa Barbara
Harvard University
University of Maryland—College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—Ann Arbor
Georgia Institute of Technology
University of Illinois—Urbana-Champaign
Purdue University—West Lafayette
University of Texas—Austin (Cockrell)
Texas A&M; University—College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—Los Angeles (Samueli)
University of California—San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—Santa Barbara
Harvard University
University of Maryland—College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—Ann Arbor
Georgia Institute of Technology
University of Illinois—Urbana-Champaign
Purdue University—West Lafayette
University of Texas—Austin (Cockrell)
Texas A&M; University—College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—Los Angeles (Samueli)
University of California—San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—Santa Barbara
Harvard University
University of Maryland—College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—Ann Arbor
Georgia Institute of Technology
University of Illinois—Urbana-Champaign
Purdue University—West Lafayette
University of Texas—Austin (Cockrell)
Texas A&M; University—College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—Los Angeles (Samueli)
University of California—San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—Santa Barbara
Harvard University
University of Maryland—College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—Ann Arbor
Georgia Institute of Technology
University of Illinois—Urbana-Champaign
Purdue University—West Lafayette
University of Texas—Austin (Cockrell)
Texas A&M; University—College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—Los Angeles (Samueli)
University of California—San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—Santa Barbara
Harvard University
University of Maryland—College Park (Clark)
University of Washington
如果你仍然只想成为ASCII,那就可以了:
import requests
import bs4
import csv
results = requests.get('http://grad-schools.usnews.rankingsandreviews.com/best-graduate-schools/top-engineering-schools/eng-rankings?int=a74509')
replacements = {ord('\N{EN DASH}'):'-',
ord('\N{EM DASH}'):'-',
ord('\N{ZERO WIDTH SPACE}'):None}
reqSoup = bs4.BeautifulSoup(results.text, "html.parser")
with open('usnwr_schools.csv', 'w', newline='', encoding='ascii') as f:
writer = csv.writer(f)
for school in reqSoup:
x = reqSoup.find_all("a", {"class" : "school-name"})
for item in x:
y = item.get_text()
writer.writerow([y.translate(replacements)])
with open('usnwr_schools.csv',encoding='ascii') as f:
print(f.read())