Question

我有一个巨大的（240mb）csv文件，其中前2行是垃圾数据。我想删除这个垃圾数据并使用之后开始的数据。

我想知道最好的选项是什么。因为它是一个大文件创建文件的副本并编辑它将是一个时间的过程。以下是csv例如： -

    junk,,,
    ,,,,
    No,name,place,destination
    1,abx,India,SA

我想拥有的是

 No,name,place,destination
 1,abx,India,SA

Answer 1

你很容易用尾巴做到这一点

tail -n+3 foo > result.data

你说前3行但是这个例子已经删除了前2行？

tail -n+2 foo > result.data

你可以在这里找到更多方法

https://unix.stackexchange.com/questions/37790/how-do-i-delete-the-first-n-lines-of-an-ascii-file-using-shell-commands

Answer 2

把这些线扔掉。

使用Dictreader解析标题

import csv

with open("filename") as fp:
  fp.readline()
  fp.readline()

  csvreader = csv.DictReader(fp, delimiter=',')
  for row in csvreader:
    #your code here

Answer 3

由于文件系统的工作方式，您不能直接从文件中删除这些行。任何这样做的方法都必然涉及在删除违规行的情况下重写整个文件。

为安全起见，在删除旧文件之前，您需要暂时存储新文件，直到您确定已成功创建新文件为止。如果你想避免将整个大文件读入内存，你会想要使用一个生成器。

这是一个生成器，它在已经返回了一定数量的项目之后返回可迭代的每个项目（例如类似文件的对象）：

def gen_after_x(iterable, x):
    # Python 3:
    yield from (item for index,item in enumerate(iterable) if index>=x)
    # Python 2:
    for index,item in enumerate(iterable):
        if index>=x:
            yield item

为了简化操作，我们将创建一个函数来编写临时文件：

def write_file(fname, lines):
    with open(fname, 'w') as f:
        for line in lines:
            f.write(line + '\n')

我们还需要os.remove模块中的os.rename和os函数来删除源文件并重命名临时文件。我们需要copyfile shutil来制作副本，因此我们可以安全地删除源文件。

现在把它们放在一起：

from os import remove, rename
from shutil import copyfile

src_file = 'big_file'
tmp_file = 'big_file_temp'
skip = 2

with open(src_file) as fin:
    olines = gen_after_x(fin, skip)
    write_file(tmp_file, olines)

src_file_copy = src_file + '_copy'
copyfile(src_file, src_file_copy)

try:
    remove(src_file)
    rename(tmp_file, src_file)
    remove(src_file_copy)
except Exception:
    try:
        copyfile(src_file_copy, src_file)
        remove(src_file_copy)
        remove(tmp_file)
    except Exception: 
        pass
    raise

然而，我会注意到240 MB这些天并不是一个巨大的文件;你可能会发现通常的方式更快，因为它减少了重复的磁盘写入：

src_file = 'big_file'
tmp_file = 'big_file_temp'
skip = 2

with open(src_file) as f:
    lines = f.readlines()

for _ in range(skip):
    lines.pop(0)

with open(tmp_file, 'w') as f:
    f.write('\n'.join(lines))

src_file_copy = src_file + '_copy'
copyfile(src_file, src_file_copy)

try:
    remove(src_file)
    rename(tmp_file, src_file)
    remove(src_file_copy)
except Exception:
    try:
        copyfile(src_file_copy, src_file)
        remove(src_file_copy)
        remove(tmp_file)
    except Exception: 
        pass
    raise

......或者如果您更喜欢风险较高的方式：

with open(src_file) as f:
    lines = f.readlines()

for _ in range(skip):
    lines.pop(0)

with open(src_file, 'w') as f:
    f.write('\n'.join(lines))

从python

3 个答案: