我的csv具有如下字符串:
TîezÑnmidnan
我正在尝试使用以下内容来设置读取器/写入器
import csv
# File that will be written to
csv_output_file = open(file, 'w', encoding='utf-8')
# File that will be read in
csv_file = open(filename, encoding='utf-8', errors='ignore')
# Define reader
csv_reader = csv.reader(csv_file, delimiter=',', quotechar='"')
# Define writer
csv_writer = csv.writer(csv_output_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
然后遍历读入的信息
# Iterate over the rows in the csv
for idx, row in enumerate(csv_reader):
csv_writer.writerow(row[0:30])
问题出在我的输出文件中,我无法使用相同的字符串来显示它。根据我的mac,csv文件类型的编码为“非ISO扩展ASCII”
我尝试了各种编码,有些会删除特殊字符,而有些则无法工作。
这很奇怪,因为我可以将上面的字符串硬编码为一个变量,并且可以毫无问题地使用它,因此我认为这与我在文件中的读取方式有关。如果我在写断点之前将其显示在调试器中,如下所示。
T�ez�nmidnan
我无法在运行文件之前对其进行转换,因此python代码必须自行处理所有转换。
我想要的预期输出将是它保留在输出文件中的样子,
TîezÑnmidnan
添加指向示例csv的链接,以显示问题以及我的代码的完整版本(已删除一些详细信息)
import tkinter as tk
from tkinter.filedialog import askopenfilename
import csv
import os
root = tk.Tk()
root.withdraw()
# Ask for file
filename = os.path.abspath(askopenfilename(initialdir="/", title="Select csv file", filetypes=(("CSV Files", "*.csv"),)))
# Set output file name
output_name = filename.rsplit('.')
del output_name[len(output_name) - 1]
output_name = "".join(output_name)
output_name += "_processed.csv"
# Using the file that will be written to
csv_output_file = open(os.path.abspath(output_name), 'w', encoding='utf-8')
# Using the file is be read in
csv_file = open(filename, encoding='utf-8', errors='ignore')
# Define reader with , delimiter
csv_reader = csv.reader(csv_file, delimiter=',', quotechar='"')
# Define writer to put quotes around input values with a comma in them
csv_writer = csv.writer(csv_output_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
header_row = []
# Iterate over the rows in the csv
for idx, row in enumerate(csv_reader):
if idx != 0:
csv_writer.writerow(row)
else:
header_row = row
csv_writer.writerow(header_row)
csv_file.flush()
csv_output_file.flush()
csv_file.close()
csv_output_file.close()
预期结果
Header1,Header2
Value1,TîezÑnmidnan
实际结果
Header1,Header2
Value1,Teznmidnan
编辑: chardetect给我“信心满满的utf-8 0.99”