我正在尝试从下面的示例csv创建一个数据框我已经给出但是我得到错误标记数据。 C错误:字符串中的EOF从第0行开始。我没有很多练习处理坏线,但我真的想学习处理这类事情的最佳方法。我在read_csv中尝试了许多不同的选项,例如error_bad_line = False,但是这两个选项都没有用。
CParserError: Error tokenizing data. C error: EOF inside string starting at line 0
我猜的是,"的行终结符导致问题,我猜测最好的方法是循环每一行和过程,所以我想出了一个不同的帮助下面的发电机,希望我很接近。真的想学习如何使用发电机和产量。
示例数据:
"USNC3255","27","US","NC","LANDS END","72305006","KNJM","KNCA","KNKT","T72305006","","","NCC031","NCZ095","","545","28594","America/New_York","34.65266","-77.07661","7","RDU","893727","
"USNC3256","27","US","NC","LANDSDOWN","72314058","KEHO","KAKH","KIPJ","T72314058","","","NCC045","NCZ068","sc007","517","28150","America/New_York","35.29374","-81.46537","797","CLT","317845","
我制作了下面的内容,删除了最后两个字符,但不确定从这些字符串产生数据帧:
def big_table_generator(filename):
with open(filename, 'rt') as f:
for line in f:
yield line[:-3]
gen = big_table_generator('../data/test_sun_file.csv')
pd.DataFrame(gen)
答案 0 :(得分:3)
我有类似的错误。通过在read_csv中使用quoting = csv.QUOTE_NONE选项修复它。
例如:
df = pd.read_csv(csvfile, header = None, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')
有关此处第二条评论中的原因的一些信息:https://github.com/pydata/pandas/issues/5500
答案 1 :(得分:0)
这是我提出的解决方案,但我真的想避免使用列表并追加并利用发电机而不是使用发电机而不够舒服。
def parse_file(filename):
newline = []
with open(filename, 'rb') as f:
reader = csv.reader(f, quoting=csv.QUOTE_NONE)
for row in reader:
newline.append([s.strip('"') for s in row[:-1]])
df = pd.DataFrame(newline)
df = df.applymap(lambda x: nan if len(x) == 0 else x).astype(object)
return df
df = parse_file(filename)