处理用Python解析的csv文件中的额外换行符(回车)?

时间:2012-06-21 20:53:15

标签: python csv newline

我有一个CSV文件,其中的字段包含换行符,例如:

A, B, C, D, E, F
123, 456, tree
, very, bla, indigo

(在这种情况下,第二行中的第三个字段是“tree \ n”

我尝试了以下内容:

import csv
catalog = csv.reader(open('test.csv', 'rU'), delimiter=",", dialect=csv.excel_tab)
for row in catalog:
    print "Length: ", len(row), row

我得到的结果是:

Length:  6 ['A', ' B', ' C', ' D', ' E', ' F']
Length:  3 ['123', ' 456', ' tree']
Length:  4 ['   ', ' very', ' bla', ' indigo']

有没有人知道如何快速删除多余的换行符?

谢谢!

6 个答案:

答案 0 :(得分:17)

假设您有此Excel电子表格:

Common 'gottchas' in an Excel file

注意:

  1. C2中的多线单元;
  2. C1和D3中嵌入的逗号;
  3. 空白单元格,以及D4中有空格的单元格。
  4. 将其保存为Excel中的CSV,您将获得此csv文件:

    A1,B1,"C1,+comma",D1
    ,B2,"line 1
    line 2",D2
    ,,C3,"D3,+comma"
    ,,,D4 space
    

    可能,你会想要将它读入Python,空白单元格仍有意义,并且嵌入的逗号处理正确。

    所以,这个:

    with open("test.csv", 'rU') as csvIN:
        outCSV=(line for line in csv.reader(csvIN, dialect='excel'))
    
        for row in outCSV:
            print("Length: ", len(row), row) 
    

    正确生成Excel中表示的4x4 List列表矩阵:

    Length:  4 ['A1', 'B1', 'C1,+comma', 'D1']
    Length:  4 ['', 'B2', 'line 1\nline 2', 'D2']
    Length:  4 ['', '', 'C3', 'D3,+comma']
    Length:  4 ['', '', '', 'D4 space']
    

    您发布的示例CSV文件在字段周围缺少引号,并带有“额外换行符”,表示该换行符的含义不明确。它是新行还是多行字段?

    因此,您只能解释此csv文件:

    A, B, C, D, E, F
    123, 456, tree
    , very, bla, indigo
    

    作为一维列表如此:

    with open("test.csv", 'rU') as csvIN:
       outCSV=[field.strip() for row in csv.reader(csvIN, delimiter=',') 
                  for field in row if field]
    

    生成这个一维列表:

    ['A', 'B', 'C', 'D', 'E', 'F', '123', '456', 'tree', 'very', 'bla', 'indigo']
    

    然后可以根据需要将其解释并重新分组到任何子分组中。

    python中的惯用重组方法使用zip,如下所示:

    >>> zip(*[iter(outCSV)]*6)
    [('A', 'B', 'C', 'D', 'E', 'F'), ('123', '456', 'tree', 'very', 'bla', 'indigo')]
    

    或者,如果你想要一个列表列表,这也是惯用的:

    >>> [outCSV[i:i+6] for i in range(0, len(outCSV),6)]
    [['A', 'B', 'C', 'D', 'E', 'F'], ['123', '456', 'tree', 'very', 'bla', 'indigo']]
    

    如果您可以更改CSV文件的创建方式,则解释起来就不那么模糊了。

答案 1 :(得分:6)

如果您有非空白单元格,这将有效

data = [['A', ' B', ' C', ' D', ' E', ' F'],
['123', ' 456', ' tree'],
['   ', ' very', ' bla', ' indigo']]

flat_list = chain.from_iterable(data)
flat_list = [cell for cell in flat_list if cell.strip() != ''] # remove blank cells

rows = [flat_list[i:i+6] for i in range(0, len(flat_list), 6)] # chunk into groups of 6 
print rows 

输出:

[['A', ' B', ' C', ' D', ' E', ' F'], ['123', ' 456', ' tree', ' very', ' bla', ' indigo']]

如果输入中有空白单元格,则大部分时间都会有效:

data = [['A', ' B', ' C', ' D', ' E', ' F'],
['123', ' 456', ' tree'],
['   ', ' very', ' bla', ' indigo']]

clean_rows = []
saved_row = []

for row in data:
    if len(saved_row):
        row_tail = saved_row.pop()
        row[0] = row_tail + row[0]  # reconstitute field broken by newline
        row = saved_row + row       # and reassemble the row (possibly only partially)
    if len(row) >= 6:
        clean_rows.append(row)
        saved_row = []
    else:
        saved_row = row


print clean_rows 

输出:

[['A', ' B', ' C', ' D', ' E', ' F'], ['123', ' 456', ' tree   ', ' very', ' bla', ' indigo']]

然而,即使是第二种解决方案也会因输入

而失败
A,B,C,D,E,F\nG
1,2,3,4,5,6

在这种情况下,输入是不明确的,没有算法能够猜出你是否意味着:

A,B,C,D,E,F
G\n1,2,3,4,5,6 

(或上面的输入)

如果您遇到这种情况,则必须返回保存数据并将其保存为更干净格式的人(btw开放式办公室引用CSV文件中的换行符远远优于Excel)。

答案 2 :(得分:1)

这应该有效。 (警告:脑编译代码)

with open('test.csv', 'rU') as infile:
   data = []
   for line in infile:
       temp_data = line.split(',')
       try:
           while len(temp_data) < 6: #column length
               temp_data.extend(infile.next())
       except StopIteration: pass
       data.append(temp_data)

答案 3 :(得分:1)

这适用于CSV模块并清除空白字段和​​行:

import csv
import StringIO

data="""A, B, C, D, E, F
123, 456, tree

,,
, very, bla, indigo"""

f=StringIO.StringIO(data)   #used just to simulate a file. Use your file here...
reader = csv.reader(f)
out=[]
for line in reader:
    line=[x.strip() for x in line if x]   # remove 'if x' if you want blank fields
    if len(line):
        out.append(line)

print out        

打印:

[['A', ' B', ' C', ' D', ' E', ' F'], 
 ['123', '456', 'tree'], 
 ['very', 'bla', 'indigo']]

如果您想要6个col块:

cols=6        
out=[i for sl in out for i in sl]                      # flatten out
out=[out[i:i+cols] for i in range(0, len(out), cols)]  # rechunk into 'cols' 

打印:

[['A', 'B', 'C', 'D', 'E', 'F'],
 ['123', '456', 'tree', 'very', 'bla', 'indigo']]

答案 4 :(得分:1)

如果每行中的字段数相同且字段不能为空:

from itertools import izip_longest

nfields = 6
with open(filename) as f:
     fields = (field.strip() for line in f for field in line.split(',') if field)
     for row in izip_longest(*[iter(fields)]*nfields): # grouper recipe*
         print(row)

* grouper recipe

Output

('A', 'B', 'C', 'D', 'E', 'F')
('123', '456', 'tree', 'very', 'bla', 'indigo')

答案 5 :(得分:0)

如果您知道列数,最好的方法是忽略行尾,然后拆分。

像这样的东西

with open(filename, 'rU') as fp:
    data = ''.join(fp.readlines())

data = data.split(',')
for n in range(0, len(data), 6)
    print(data[n:n+6])

如果您愿意,可以将其轻松转换为生成器:

def read_ugly_file(filename, delimiter=',', columns=6):
    with open(filename, 'rU') as fp:
        data = ''.join(fp.readlines())

    data = data.split(delimiter)
    for n in range(0, len(data), columns)
        yield data[n:n+columns]

for row in read_ugly_file('myfile.csv'):
    print(row)