快速,准确,可靠的方法从csv文件中删除不需要的值

时间:2016-04-25 21:51:57

标签: python csv data-processing

我有一个大型的csv文件,其中包含大量的脏数据,我想通过消除所有非绝对必要的值来清理它。

Here是我正在谈论的文件。

它有组件:

Website Title Start Date Employer Location lat lon Country < strong>, Skills11 Jobs

但我想删除所有,但是:

Employer Location Country Jobs

是否有适合此任务的特定工具?

或许某人有一个方便的Python脚本可以完成工作?

2 个答案:

答案 0 :(得分:4)

您可以轻松地将python写入临时文件,然后替换原始文件。

import  csv
from operator import itemgetter
from tempfile import NamedTemporaryFile
from shutil import move

with open("edsa_data.csv") as f, NamedTemporaryFile(dir=".", delete=False) as tmp:
    # itertools.imap python2
    csv.writer(tmp).writerows(map(itemgetter(3, 5, 7, 9), csv.reader(f)))
move(tmp.name, "edsa_data.csv")

更通用的方法:

import csv
from operator import itemgetter
from tempfile import NamedTemporaryFile
from shutil import move




def keep_columns(csv_f, keep_cols, **kwargs):
    with open(csv_f) as f, NamedTemporaryFile("w", dir=".", delete=False) as tmp:
        csv.writer(tmp, **kwargs).writerows(itemgetter(*keep_cols)(row) 
                                            for row in csv.reader(f, **kwargs))
    move(tmp.name, csv_f)


keep_columns("edsa_data.csv", (3, 4, 7, 9))

对于kwargs,您可以传递 sep =“,” skipinitialspace = True 等。

答案 1 :(得分:2)

为了便于维护,我使用DictReader / DictWriter对。

import csv
import sys

with open(sys.argv[1], 'r') as csv_infile:
    with open(sys.argv[2], 'w') as csv_outfile:
        csv_in = csv.DictReader(csv_infile)
        csv_out = csv.DictWriter(
            csv_outfile,
            ['Employer','Location','Country','Jobs'],
            extrasaction='ignore')
        csv_out.writeheader()
        csv_out.writerows(csv_in)