我有一个数据转储是一个“混乱”的CSV。 (大约100个文件,每个文件包含大约1000行实际CSV数据。)
除了CSV之外,转储还有一些其他文本。如何以编程方式单独提取CSV部分?
作为示例,数据文件看起来像这样
Session:1
Data collection date: 09-09-2016
Related questions:
Question 1: parta, partb, partc,
Question 2: parta, partb, partc
"field1","field2","field3","field4"
"data11","data12","data13","data14"
"data21","data22","data23","data24"
"data31","data32","data33","data34"
"data41","data42","data43","data44"
"data51","data52","data53","data54"
我需要提取csv部分。
警告,
开头的文字不限于4 - 5行
附加文本不仅仅在文件的开头
我看到this post建议使用re.split和/或csv.Sniffer, 但是我的尝试并不富有成效。
with open("untitled.csv") as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
print(dialect.__dict__)
csvstarts = False
csvdump = []
for ln in csvfile.readlines():
toks = re.split(r'[,]', ln)
print(toks)
if toks[0] == '"field1"' and not csvstarts: # identify by the header line
csvstarts = True
continue
if csvstarts:
if toks[0] == '"field1"': # identify the start of subsequent csv data
csvstarts = False
continue
csvdump.append(ln) # record the current line
print(csvdump)
现在,如果有一堆数据,我只能准确识别csv行。
我能做得更好吗?
答案 0 :(得分:1)
这个怎么样:
import re
my_pattern = re.compile("(\"[\w]+\",)+")
with open('<your_file>', 'rb') as fi:
for f in fi:
result = my_pattern.match(f)
if result:
print f
假设csv数据可以通过其中没有特殊字符与其他数据区分开(我们只接受每个元素的字母或数字用双引号括起来,逗号与下一个元素分开)
答案 1 :(得分:0)
如果您的csv行只有那些行以\“开头,那么您可以这样做:
import csv
data = list(csv.reader(open("test.csv", 'rb'), quotechar='¬'))
# for quotechar - use something that won't turn up in data
def importCSV(data):
# outputs list of list with required data
# works on the assumption that all required data starts with \"
# and that no text starts with \"
out = []
for line in data:
if (line != []) and (line[0][0] == "\""):
line = [el.replace("\"", "") for el in line]
out.append(line)
return out
useful = importCSV(data)
答案 2 :(得分:0)
您是否可以阅读每一行并使用正则表达式查看天气与否来提取数据? 也许是这样的:
^([&#34;] [\ W]的 [&#34] [,])+ [&#34;] [\ W] [&#34;] $
我的正则表达不是最好的,可能有更好的方法,但这似乎对我有用。