EDIT2：

Question

我有一些亚马逊评论数据，我已成功从文本格式转换为CSV格式，现在问题是当我尝试使用pandas将其读入数据帧时，我收到错误信息： UnicodeDecodeError：＆＃39; utf-8＆＃39;编解码器不能解码位置13中的字节0xf8：无效的起始字节

我了解审核原始数据中必须有一些非utf-8，如何删除非UTF-8并保存到另一个CSV文件？

谢谢你！

EDIT1：这是我将文本转换为csv的代码：

import csv
import string
INPUT_FILE_NAME = "small-movies.txt"
OUTPUT_FILE_NAME = "small-movies1.csv"
header = [
    "product/productId",
    "review/userId",
    "review/profileName",
    "review/helpfulness",
    "review/score",
    "review/time",
    "review/summary",
    "review/text"]
f = open(INPUT_FILE_NAME,encoding="utf-8")

outfile = open(OUTPUT_FILE_NAME,"w")

outfile.write(",".join(header) + "\n")
currentLine = []
for line in f:

   line = line.strip()  
   #need to reomve the , so that the comment review text won't be in many columns
   line = line.replace(',','')

   if line == "":
      outfile.write(",".join(currentLine))
      outfile.write("\n")
      currentLine = []
      continue
   parts = line.split(":",1)
   currentLine.append(parts[1])

if currentLine != []:
    outfile.write(",".join(currentLine))
f.close()
outfile.close()

EDIT2：

感谢你们所有人试图帮助我。所以我通过修改代码中的输出格式来解决它：

 outfile = open(OUTPUT_FILE_NAME,"w",encoding="utf-8")

Answer 1

如果输入文件不是utf-8编码的话，尝试在utf-8中读取它可能不是一个好主意......

您基本上有两种方法来处理解码错误：

使用一个可以接受任何字节的字符集，例如iso-8859-15，也称为latin9
如果输出应为utf-8但包含错误，请使用errors=ignore - ＆gt;以静默方式删除非utf-8字符，或errors=replace - ＆gt;用替换标记（通常为?）

例如：

f = open(INPUT_FILE_NAME,encoding="latin9")

或

f = open(INPUT_FILE_NAME,encoding="utf-8", errors='replace')

Answer 2

如果您使用 python3 ，它会为unicode内容提供内置支持 -

f = open('file.csv', encoding="utf-8")

如果您仍想从中删除所有unicode数据，可以将其作为普通文本文件读取并删除unicode内容

def remove_unicode(string_data):
    """ (str|unicode) -> (str|unicode)

    recovers ascii content from string_data
    """
    if string_data is None:
        return string_data

    if isinstance(string_data, bytes):
        string_data = bytes(string_data.decode('ascii', 'ignore'))
    else:
        string_data = string_data.encode('ascii', 'ignore')

    remove_ctrl_chars_regex = re.compile(r'[^\x20-\x7e]')

    return remove_ctrl_chars_regex.sub('', string_data)

with open('file.csv', 'r+', encoding="utf-8") as csv_file:
     content = remove_unicode(csv_file.read())
     csv_file.seek(0)
     csv_file.write(content)

现在您可以在没有任何unicode数据问题的情况下阅读它。

如何删除非utf 8代码并保存为csv文件python

EDIT2：

2 个答案: