如何过滤csv文件?

时间:2017-11-06 09:22:17

标签: python django csv

我有一个包含随机数据的csv文件,但我想从文件中过滤数据。 我想过滤所有内容以$开头并以#

结尾的行
2017-09-07 03:11:03,5,hello
2017-09-07 03:11:16,6,yellow
2017-09-07 03:11:22,28,some other stuff with spaces
2017-09-08 20:24:36,157,
        2017-10-28 04:39:25,54,$SITE0011,1654,0000,0000,0000,00000000,000000^A^A^A^A^A^A^@^@#
        2017-10-28 04:39:48,108,$SITE0011,1654,0000,0000,0000,00000000,000000^A^A^A^A^A^A^@^@#$SITE0011,1654,0000,0000,0000,00000000,000000^A^A^A^A^A^A^@^@#
        2017-10-28 04:40:26,54,$SITE0011,1654,0000,0000,0000,00000000,000000^A^A^A^A^A^A^@^@#
        2017-10-28 04:40:29,54,$SITE0011,1654,0000,0000,0000,00000000,000000^A^A^A^A^A^A^@^@#

1 个答案:

答案 0 :(得分:2)

我认为这对于过滤生成器函数来说是一个很好的用例:

import re
import csv


def filter_lines(f):
    """this generator funtion uses a regular expression
    to include only lines that have a `$` and end with a `#`.
    """
    filter_regex = r'.*\$.*\#$'
    for line in f:
        line = line.strip()
        m = re.match(filter_regex, line)
        if m:
            yield line


with open(CSV_FILENAME) as f:
    filter_generator = filter_lines(f)
    csv_reader = csv.reader(filter_generator)
    for row in csv_reader:
        pass

编辑:

我现在意识到,在你的例子中,单个“行”可以包含多个匹配(如第6行所示)。这个稍微修改过的版本也可以处理它:

import re
import csv


def filter_lines(f):
    """this generator funtion uses a regular expression
    to include only lines that have a `$` and end with a `#`.
    """
    filter_regex = r'(\$[^#]*\#)'
    for line in f:
        line = line.strip()
        matches = re.findall(filter_regex, line)
        for m in matches:
            yield m


with open(CSV_FILENAME) as f:
    filter_generator = filter_lines(f)
    csv_reader = csv.reader(filter_generator)
    for row in csv_reader:
        print row

从示例输入生成的输出:

['$SITE0011', '1654', '0000', '0000', '0000', '00000000', '000000^A^A^A^A^A^A^@^@#']
['$SITE0011', '1654', '0000', '0000', '0000', '00000000', '000000^A^A^A^A^A^A^@^@#']
['$SITE0011', '1654', '0000', '0000', '0000', '00000000', '000000^A^A^A^A^A^A^@^@#']
['$SITE0011', '1654', '0000', '0000', '0000', '00000000', '000000^A^A^A^A^A^A^@^@#']
['$SITE0011', '1654', '0000', '0000', '0000', '00000000', '000000^A^A^A^A^A^A^@^@#']