在python中识别csv

时间:2016-09-22 02:44:59

标签: python csv

我有一个数据转储是一个“混乱”的CSV。 (大约100个文件,每个文件包含大约1000行实际CSV数据。)
除了CSV之外,转储还有一些其他文本。如何以编程方式单独提取CSV部分?

作为示例,数据文件看起来像这样

Session:1
Data collection date: 09-09-2016
Related questions:
    Question 1: parta, partb, partc,
    Question 2: parta, partb, partc

"field1","field2","field3","field4"
"data11","data12","data13","data14"
"data21","data22","data23","data24"
"data31","data32","data33","data34"
"data41","data42","data43","data44"
"data51","data52","data53","data54"

我需要提取csv部分。

警告,
开头的文字不限于4 - 5行 附加文本不仅仅在文件的开头

我看到this post建议使用re.split和/或csv.Sniffer, 但是我的尝试并不富有成效。

with open("untitled.csv") as csvfile:
    dialect = csv.Sniffer().sniff(csvfile.read(1024))
    csvfile.seek(0)
    print(dialect.__dict__)
    csvstarts = False
    csvdump = []
    for ln in csvfile.readlines():
        toks = re.split(r'[,]', ln)
        print(toks)
        if toks[0] == '"field1"' and not csvstarts: # identify by the header line
            csvstarts = True
            continue
        if csvstarts:
            if toks[0] == '"field1"': # identify the start of subsequent csv data
                csvstarts = False
                continue
            csvdump.append(ln)  # record the current line

    print(csvdump)

现在,如果有一堆数据,我只能准确识别csv行。

我能做得更好吗?

3 个答案:

答案 0 :(得分:1)

这个怎么样:

import re

my_pattern = re.compile("(\"[\w]+\",)+")

with open('<your_file>', 'rb') as fi:
    for f in fi:
        result = my_pattern.match(f)
        if result:
            print f

假设csv数据可以通过其中没有特殊字符与其他数据区分开(我们只接受每个元素的字母或数字用双引号括起来,逗号与下一个元素分开)

答案 1 :(得分:0)

如果您的csv行只有那些行以\“开头,那么您可以这样做:

import csv

data = list(csv.reader(open("test.csv", 'rb'), quotechar='¬'))
# for quotechar - use something that won't turn up in data

def importCSV(data):
    # outputs list of list with required data
    # works on the assumption that all required data starts with \"
    # and that no text starts with \"

    out = []

    for line in data:
        if (line != []) and (line[0][0] == "\""):
            line = [el.replace("\"", "") for el in line]
            out.append(line)

    return out

useful = importCSV(data)

答案 2 :(得分:0)

您是否可以阅读每一行并使用正则表达式查看天气与否来提取数据? 也许是这样的:

^([&#34;] [\ W]的 [&#34] [,])+ [&#34;] [\ W] [&#34;] $

我的正则表达不是最好的,可能有更好的方法,但这似乎对我有用。