Question

我有大约100个CSV，它们都包含来自不同来源的数据，因此具有不同的分隔符。是否有可以猜测CSV结构的python库？

例如，有人有这样的表：

color, shape, avg weight, 
red, square, 15g, 
blue, circle, 11g,

他们保存的csv看起来像：

'color', 'shape', 'avg weight', 'red', 'square', '15g', 'blue', 'circle', '11g'

如果我知道列数（我使用函数计算），我可以创建一个列表列表，然后将其设为pandas DataFrame。

但是，许多人的数据在行末没有逗号，如下所示：

color, shape, avg weight 
red, square, 15g 
blue, circle, 11g

他们发送的CSV状态如下：

'color', 'shape', 'avg weight' 'red', 'square', '15g' 'blue', 'circle', '11g'

如果avg weight中没有值，那就更糟了：

color, shape, avg weight 
red, square,
blue, circle, 11g

导致CSV看起来像：

'color', 'shape', 'avg weight' '', 'square', '15g' 'blue', 'circle', '11g'

我该如何处理？或者我可以探索的图书馆是什么？

Answer 1

如果您至少确定引号，这种方法可能有效。我们的想法是将引用的表达式与正则表达式匹配，然后利用我们关于列数的知识来构成数据帧。如果您事先不知道列数，并且您不能依赖引号，我认为没有合理的方法可以在没有换行符的情况下重建数据。

import re
import pandas

s = "'color', 'shape', 'avg weight' '', 'square', '15g' 'blue', 'circle', '11g'"

Ncols = 3
r = re.compile("'([^']*)'")
items = r.findall(s)
table = [items[i*Ncols:i*Ncols+Ncols] for i in range(len(items)//Ncols)]

df = pandas.DataFrame(table[1:], columns=table[0])

Python：如何处理在行末没有逗号的csv？

1 个答案: