我有一系列非常混乱的* .csv文件正在被pandas读入。 csv的一个例子是:
Instrument 35392
"Log File Name : station"
"Setup Date (MMDDYY) : 031114"
"Setup Time (HHMMSS) : 073648"
"Starting Date (MMDDYY) : 031114"
"Starting Time (HHMMSS) : 090000"
"Stopping Date (MMDDYY) : 031115"
"Stopping Time (HHMMSS) : 235959"
"Interval (HHMMSS) : 010000"
"Sensor warmup (HHMMSS) : 000200"
"Circltr warmup (HHMMSS) : 000200"
"Date","Time","","Temp","","SpCond","","Sal","","IBatt",""
"MMDDYY","HHMMSS","","øC","","mS/cm","","ppt","","Volts",""
"Random message here 031114 073721 to 031114 083200"
03/11/14,09:00:00,"",15.85,"",1.408,"",.74,"",6.2,""
03/11/14,10:00:00,"",15.99,"",1.96,"",1.05,"",6.3,""
03/11/14,11:00:00,"",14.2,"",40.8,"",26.12,"",6.2,""
03/11/14,12:00:01,"",14.2,"",41.7,"",26.77,"",6.2,""
03/11/14,13:00:00,"",14.5,"",41.3,"",26.52,"",6.2,""
03/11/14,14:00:00,"",14.96,"",41,"",26.29,"",6.2,""
"message 3"
"message 4"**
我一直在使用此代码导入* csv文件,处理双标题,拉出空列,然后使用错误数据删除有问题的行:
DF = pd.read_csv(BADFILE,parse_dates={'Datetime_(ascii)': [0,1]}, sep=",", \
header=[10,11],na_values=['','na', 'nan nan'], \
skiprows=[10], encoding='cp1252')
DF = DF.dropna(how="all", axis=1)
DF = DF.dropna(thresh=2)
droplist = ['message', 'Random']
DF = DF[~DF['Datetime_(ascii)'].str.contains('|'.join(droplist))]
DF.head()
Datetime_(ascii) (Temp, øC) (SpCond, mS/cm) (Sal, ppt) (IBatt, Volts)
0 03/11/14 09:00:00 15.85 1.408 0.74 6.2
1 03/11/14 10:00:00 15.99 1.960 1.05 6.3
2 03/11/14 11:00:00 14.20 40.800 26.12 6.2
3 03/11/14 12:00:01 14.20 41.700 26.77 6.2
4 03/11/14 13:00:00 14.50 41.300 26.52 6.2
这个工作正常,花花公子,直到我有一个文件在标题后有一个错误的1行行:“随机消息这里031114 073721到031114 083200”
我收到的错误是:
*C:\Users\USER\AppData\Local\Continuum\Anaconda3\lib\site-
packages\pandas\io\parsers.py in _do_date_conversions(self, names, data)
1554 data, names = _process_date_conversion(
1555 data, self._date_conv, self.parse_dates, self.index_col,
-> 1556 self.index_names, names,
keep_date_col=self.keep_date_col)
1557
1558 return names, data
C:\Users\USER\AppData\Local\Continuum\Anaconda3\lib\site-
packages\pandas\io\parsers.py in _process_date_conversion(data_dict,
converter, parse_spec, index_col, index_names, columns, keep_date_col)
2975 if not keep_date_col:
2976 for c in list(date_cols):
-> 2977 data_dict.pop(c)
2978 new_cols.remove(c)
2979
KeyError: ('Time', 'HHMMSS')*
如果我删除该行,代码工作正常。同样,如果我删除 header = 行,代码工作正常。但是,我希望能够保留这个,因为我正在阅读数百个这些文件。
难度:在调用 pandas.read_csv()之前,我宁愿不打开每个文件,因为这些文件可能相当大 - 因此我不想多次读取和保存!另外,我更喜欢真正的pandas / pythonic解决方案,它不涉及首先打开文件作为stringIO缓冲区来删除违规行。
答案 0 :(得分:1)
这是一种方法,利用skip_rows
接受可调用函数的事实。该函数仅接收正在考虑的行索引,这是该参数的内置限制。
因此,可调用函数skip_test()
首先检查当前索引是否在要跳过的已知索引集中。如果没有,则打开实际文件并检查相应的行以查看其内容是否匹配。
skip_test()
函数在确实检查实际文件的意义上有点hacky,尽管它只会检查它当前正在评估的行索引。它还假设坏行始终以相同的字符串开头(在示例中为"foo"
),但这似乎是给定OP的安全假设。
# example data
""" foo.csv
uid,a,b,c
0,1,2,3
skip me
1,11,22,33
foo
2,111,222,333
"""
import pandas as pd
def skip_test(r, fn, fail_on, known):
if r in known: # we know we always want to skip these
return True
# check if row index matches problem line in file
# for efficiency, quit after we pass row index in file
f = open(fn, "r")
data = f.read()
for i, line in enumerate(data.splitlines()):
if (i == r) & line.startswith(fail_on):
return True
elif i > r:
break
return False
fname = "foo.csv"
fail_str = "foo"
known_skip = [2]
pd.read_csv(fname, sep=",", header=0,
skiprows=lambda x: skip_test(x, fname, fail_str, known_skip))
# output
uid a b c
0 0 1 2 3
1 1 11 22 33
2 2 111 222 333
如果您确切知道随机消息出现在哪一行上,那么这将会快得多,因为您可以告诉它不要检查文件内容中是否有超过潜在违规行的索引。
答案 1 :(得分:0)
昨天经过一些修修补补后,我找到了一个解决方案以及潜在的问题。
我尝试了上面的 skip_test()功能回答,但我仍然遇到表格大小的错误:
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10862)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory (pandas\_libs\parsers.c:11138)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)()
ParserError: Error tokenizing data. C error: Expected 1 fields in line 14, saw 11
所以在玩了 skiprows = 之后,我发现在使用引擎=' c' 时我没有得到我想要的行为。 read_csv()仍然从前几行确定文件的大小,并且仍然传递了一些单列行。可能是我的csv集中有一些不好的单列行,我没有计划。
相反,我创建了一个任意大小的DataFrame作为模板。我拉入整个 .csv 文件,然后使用逻辑去除 NaN 行。
例如,我知道我将在数据中遇到的最大表格将是10行。所以我对大熊猫的呼吁是:
DF = pd.read_csv(csv_file, sep=',', \
parse_dates={'Datetime_(ascii)': [0,1]},\
na_values=['','na', '999999', '#'], engine='c',\
encoding='cp1252', names = list(range(0,10)))
然后我使用这两行来删除DataFrame中的 NaN 行和列:
#drop the null columns created by double deliminators
DF = DF.dropna(how="all", axis=1)
DF = DF.dropna(thresh=2) # drop if we don't have at least 2 cells with real values