Question

官方tensorflow教程建议使用tf.TextLineReader解析csv文件，逐行读取文件，然后使用tf.decode_csv（source）。但这对包含带换行符的字符串的csv记录不起作用，因为这会导致读者将单个csv记录拆分。

解析这些类型文件的最佳方法是什么？

Answer 1

pandas.read_csv()可以正确解析此类CSV文件：

CSV：

a,b,c
1,"text which includes
line
breaks",100
2,another line,200
3,yet another line,300

import pd as pandas

df = pd.read_csv(r'D:\temp\1.csv')

结果：

In [21]: df
Out[21]:
   a                                      b    c
0  1  text which includes\r\nline\r\nbreaks  100
1  2                           another line  200
2  3                       yet another line  300

Answer 2

tf.decode_csv需要RFC 4180格式的CSV文件，根据RFC4180，换行符（CRLF）确实应该划分记录。

TensorFlow 1.8版已经将API tf.contrib.data.make_csv_dataset引入了将CSV文件读入数据集。我不知道它是否能解决你的问题，但值得一试。

如何用张量流中的csv文件中的换行符解析字符串？

2 个答案: