Question

简而言之，我有一个20,000,000行csv文件，它有不同的行长度。这是由于古老的数据记录器和专有格式。我们以下列格式将最终结果作为csv文件获取。我的目标是将此文件插入postgres数据库。我该怎么做：

保留前8列和最后2列，以获得一致的CSV文件
在第一个或最后一个位置向csv文件ether添加一个新列。

1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0 img_id.jpg, -50

Answer 1

使用csv读取一行，然后：

newrow = row[:8] + row[-2:]

然后添加新字段并将其写出（也使用csv）。

Answer 2

您可以将文件作为文本文件打开，并一次读取一行。是否有引用或转义的逗号不是“分割字段”？如果没有，你可以做

with open('thebigfile.csv', 'r') as thecsv:
    for line in thecsv:
        fields = [f.strip() for f in thecsv.split(',')]
        consist = fields[:8] + fields[-2:] + ['onemore']
        ... use the `consist` list as warranted ...

我怀疑在+ ['onemore']我可能想要“添加一个列”，如你所说，有一些非常不同的内容，但当然我无法猜测它可能是什么。

不要将每一行单独插入数据库 - 2000万次插入需要长时间。相反，将“制作一致”列表分组，将它们附加到临时列表 - 每次该列表的长度达到1000时，使用executemany添加所有这些条目。

修改：澄清一下，我不建议使用csv来处理您知道的文件不在“正确的“csv格式：直接处理它会给你更直接的控制（特别是当你发现每行不同数量的逗号之外的其他不规则时）。

Answer 3

我建议使用csv模块。这里有一些基于CSV处理的代码，我在其他地方已经完成了

from __future__ import with_statement
import csv

def process( reader, writer):
    for line in reader:
        data = row[:8] + row[-2:]
        writer.write( data )

def main( infilename, outfilename ):
    with open( infilename, 'rU' ) as infile:
        reader = csv.reader( infile )
        with open( outfilename, 'w') as outfile:
            writer = csv.writer( outfile )
            process( reader, writer )

if __name__ == '__main__':
    if len(sys.argv) != 3:
        print "syntax: python process.py filename outname"
        sys.exit(1)
    main( sys.argv[1], sys.argv[2] )

Answer 4

抱歉，您需要使用此代码编写一些代码。当你有这样一个巨大的文件时，值得检查所有文件，以确保它符合你的期望。如果您将不愉快的数据存入您的数据库，您将永远无法获得所有数据。

记住关于CSV的奇怪之处：它是一堆类似标准的混杂，它们有关于引用，转义，空字符，unicode，空字段（“,,,”）的不同规则，多行输入和空行。 csv模块有'方言'和选项，你可能会发现csv.Sniffer类很有帮助。

我建议你：

运行'tail'命令查看最后几行。
如果它表现得很好，可以通过csv阅读器运行整个文件来查看它打破了。快速制作“每行字段”的直方图。
考虑“有效”范围和字符类型，并严格检查它们你读。特别注意不寻常的unicode或字符以外的字符可打印范围。
认真考虑是否要将额外的奇数球值保留在“其余行”文本字段中。
将任何意外的行放入异常文件中。
修复代码以处理例外文件中的新模式。冲洗。重复。
最后，再次运行整个过程，实际上将数据转储到数据库中。

在完成任务之前，不接触数据库的开发时间会更快。另外，请注意SQLite在只读数据上的速度非常快，因此PostGres可能不是最佳解决方案。

你的最终代码可能看起来像这样，但我不知道你不知道你的数据，特别是它的表现如何：

while not eof
    out = []
    for chunk in range(1000):
       try:
          fields = csv.reader.next()
       except StopIteration:
          break
       except:
          print str(reader.line_num) + ", 'failed to parse'"
       try:
          assert len(fields) > 5 and len(fields < 12)
          assert int(fields[3]) > 0 and int(fields[3]) < 999999
          assert int(fields[4]) >= 1 and int(fields[4] <= 12) # date
          assert field[5] == field[5].strip()  # no extra whitespace
          assert not field[5].strip(printable_chars)  # no odd chars
          ...
       except AssertionError:
          print str(reader.line_num) + ", 'failed checks'"
       new_rec = [reader.line_num]  # new first item
       new_rec.extend(fields[:8])   # first eight
       new_rec.extend(fields[-2:])  # last two
       new_rec.append(",".join(field[8:-2])) # and the rest
       out.append(new_rec)
    if database:
       cursor.execute_many("INSERT INTO raw_table VALUES %d,...", out)

当然，您的里程数因此而异。这是pseduo代码的初稿。期望为输入编写可靠的代码以花费大约一天的时间。

Python - CSV：包含不同长度行的大文件

4 个答案: