我有一些亚马逊评论数据,我已成功从文本格式转换为CSV格式,现在问题是当我尝试使用pandas将其读入数据帧时,我收到错误信息: UnicodeDecodeError:' utf-8'编解码器不能解码位置13中的字节0xf8:无效的起始字节
我了解审核原始数据中必须有一些非utf-8,如何删除非UTF-8并保存到另一个CSV文件?
谢谢你!EDIT1: 这是我将文本转换为csv的代码:
import csv
import string
INPUT_FILE_NAME = "small-movies.txt"
OUTPUT_FILE_NAME = "small-movies1.csv"
header = [
"product/productId",
"review/userId",
"review/profileName",
"review/helpfulness",
"review/score",
"review/time",
"review/summary",
"review/text"]
f = open(INPUT_FILE_NAME,encoding="utf-8")
outfile = open(OUTPUT_FILE_NAME,"w")
outfile.write(",".join(header) + "\n")
currentLine = []
for line in f:
line = line.strip()
#need to reomve the , so that the comment review text won't be in many columns
line = line.replace(',','')
if line == "":
outfile.write(",".join(currentLine))
outfile.write("\n")
currentLine = []
continue
parts = line.split(":",1)
currentLine.append(parts[1])
if currentLine != []:
outfile.write(",".join(currentLine))
f.close()
outfile.close()
感谢你们所有人试图帮助我。 所以我通过修改代码中的输出格式来解决它:
outfile = open(OUTPUT_FILE_NAME,"w",encoding="utf-8")
答案 0 :(得分:5)
如果输入文件不是utf-8编码的话,尝试在utf-8中读取它可能不是一个好主意......
您基本上有两种方法来处理解码错误:
errors=ignore
- >以静默方式删除非utf-8字符,或errors=replace
- >用替换标记(通常为?
)例如:
f = open(INPUT_FILE_NAME,encoding="latin9")
或
f = open(INPUT_FILE_NAME,encoding="utf-8", errors='replace')
答案 1 :(得分:-1)
如果您使用 python3 ,它会为unicode内容提供内置支持 -
f = open('file.csv', encoding="utf-8")
如果您仍想从中删除所有unicode数据,可以将其作为普通文本文件读取并删除unicode内容
def remove_unicode(string_data):
""" (str|unicode) -> (str|unicode)
recovers ascii content from string_data
"""
if string_data is None:
return string_data
if isinstance(string_data, bytes):
string_data = bytes(string_data.decode('ascii', 'ignore'))
else:
string_data = string_data.encode('ascii', 'ignore')
remove_ctrl_chars_regex = re.compile(r'[^\x20-\x7e]')
return remove_ctrl_chars_regex.sub('', string_data)
with open('file.csv', 'r+', encoding="utf-8") as csv_file:
content = remove_unicode(csv_file.read())
csv_file.seek(0)
csv_file.write(content)
现在您可以在没有任何unicode数据问题的情况下阅读它。