所以我有这样的 csv 数据:
1, 2, 3, bla bla bla, 4, 5;
"1, 2, 3, ""bla, bla, bla"", 4, 5";
"6, 7, 8, ""more, bla, bla"", 9, 10";
6, 7, 8, more bla bla, 9, 10;
本质上:某一列有一个带有分隔符的字符串,它用双双引号引起来,而整行也用引号引起来。
我已经用熊猫试过了:
df = pd.read_csv("data.csv", sep=',', skipinitialspace=True, quotechar='"', doublequote=True)
但是因为有些行是用引号引起来的,所以它把它放到了第一列中:
column1 column12 column13 column14 column15 column16
1 2 3 bla bla bla 4 5
1,2,3,"bla, bla, bla", 4, 5 nan nan nan nan nan
6,7,8,"more, bla, bla",9,10 nan nan nan nan nan
6 7 8 more bla bla 9 10
如何让这些引用的行进行相应的操作?
答案 0 :(得分:2)
一种方法是在将其加载到 Pandas 之前对其进行预处理:
import csv
import pandas as pd
import io
data = []
with open('input.csv') as f_input:
for line in f_input:
line = line.strip('";\n').replace('""', '"')
row = next(csv.reader(io.StringIO(line, newline=''), skipinitialspace=True))
data.append(row)
df = pd.DataFrame(data)
print(df)
给予:
0 1 2 3 4 5
0 1 2 3 bla bla bla 4 5
1 1 2 3 bla, bla, bla 4 5
2 6 7 8 more, bla, bla 9 10
3 6 7 8 more bla bla 9 10
或者你可以写出固定版本供以后使用:
with open('output.csv', 'w', newline='') as f_output:
csv.writer(f_output).writerows(data)