我有以下代码片段:
import csv
def has_header(first_lines):
sniffer = csv.Sniffer()
return sniffer.has_header(first_lines)
其中first_lines
是文件的前2048个字节。该函数在大多数情况下都能很好地工作,并返回True
这样的文件:
SPEC#: 1, SIZE: 18473, TIME: 0.000000
1998.304312 2.15686
1998.773585 3.13725
1999.242914 3.13725
1999.712298 2.7451
2000.181736 2.94118
2000.651230 2.94118
2001.120780 2.15686
2001.590384 2.35294
2002.060043 2.94118
2002.529758 3.13725
2002.999527 2.54902
2003.469352 3.13725
2003.939232 1.96078
2004.409167 1.76471
2004.879158 2.94118
2005.349203 3.72549
2005.819304 3.33333
2006.289459 2.35294
2006.759670 1.76471
2007.229936 3.13725
2007.700258 3.52941
2008.170634 3.92157
2008.641065 3.92157
2009.111552 3.52941
2009.582094 4.70588
2010.052691 3.52941
2010.523343 3.33333
2010.994050 1.37255
2011.464812 2.35294
2011.935630 2.15686
2012.406502 3.52941
2012.877430 3.13725
2013.348413 2.15686
2013.819451 1.96078
2014.290544 1.56863
2014.761693 3.13725
2015.232896 1.76471
2015.704155 1.96078
2016.175469 3.33333
2016.646838 4.90196
2017.118262 3.52941
2017.589741 2.94118
2018.061275 1.96078
2018.532865 1.76471
2019.004510 4.11765
2019.476210 3.72549
2019.947965 2.35294
2020.419775 1.56863
2020.891640 2.15686
2021.363560 0.784314
2021.835536 1.37255
2022.307567 2.94118
2022.779653 2.15686
2023.251794 4.11765
2023.723990 4.5098
2024.196241 2.7451
2024.668548 2.54902
2025.140909 1.56863
2025.613326 2.94118
2026.085798 2.35294
2026.558325 2.94118
2027.030907 3.33333
2027.503545 3.52941
2027.976237 3.72549
2028.448985 5.09804
2028.921788 4.11765
2029.394645 3.92157
2029.867559 3.13725
2030.340527 2.15686
2030.813550 2.35294
2031.286629 5.09804
2031.759762 3.33333
2032.232951 3.52941
2032.706195 3.13725
2033.179494 4.70588
2033.652849 4.31373
2034.126258 3.92157
2034.599723 3.33333
2035.073242 3.72549
2035.546817 2.54902
2036.020447 2.35294
2036.494132 2.15686
2036.967873 2.94118
2037.441668 2.7451
2037.915519 2.15686
2038.389425 2.54902
2038.863385 2.54902
2039.337401 2.35294
2039.811473 3.52941
2040.285599 3.52941
2040.759781 4.11765
2041.234017 3.52941
2041.708309 3.72549
2042.182656 3.33333
2042.657058 2.54902
2043.131515 4.11765
2043.606028 3.52941
2044.080595 4.11765
2044.555218 2.35294
2045.029896 1.96078
但是,对于具有相同整体结构的以下文件,该函数返回False
:
SPEC#: 1, SIZE: 18474, TIME: 0.000000
1998.228113 36.8627
1998.697368 30.1961
1999.166679 35.8824
1999.636044 41.3726
2000.105465 38.6275
2000.574941 39.0196
2001.044473 41.9608
2001.514059 37.8431
2001.983701 35.098
2002.453397 37.2549
2002.923149 36.4706
2003.392956 40.5882
2003.862818 39.0196
2004.332735 36.6667
2004.802708 33.3333
2005.272735 37.8431
2005.742818 35.098
2006.212955 33.3333
2006.683148 39.0196
2007.153397 41.3726
2007.623700 41.7647
2008.094058 39.4118
2008.564472 40.7843
2009.034940 44.1176
2009.505464 42.1569
2009.976043 40
2010.446677 38.2353
2010.917366 39.2157
2011.388111 39.2157
2011.858910 36.2745
2012.329765 38.0392
2012.800675 42.3529
2013.271640 44.7059
2013.742660 38.0392
2014.213735 40
2014.684866 39.8039
2015.156051 44.902
2015.627292 41.7647
2016.098588 44.3137
2016.569939 43.9216
2017.041345 50.1961
2017.512806 51.5686
2017.984323 46.4706
2018.455894 44.7059
2018.927521 41.9608
2019.399203 46.6667
2019.870940 41.3726
2020.342732 45.8824
2020.814579 45.2941
2021.286482 45.2941
2021.758439 49.2157
2022.230452 42.1569
2022.702520 45.2941
2023.174643 45.4902
2023.646821 40.5882
2024.119054 47.0588
2024.591343 42.1569
2025.063686 43.7255
2025.536085 46.0784
2026.008539 45.098
2026.481048 44.902
2026.953612 50
2027.426231 48.4314
2027.898906 45.098
2028.371636 49.2157
2028.844420 47.8431
2029.317260 51.7647
2029.790155 49.4118
2030.263105 45.8824
2030.736111 51.1765
2031.209171 47.6471
2031.682287 52.549
2032.155458 50.7843
2032.628684 45.6863
2033.101965 48.6275
2033.575301 49.4118
2034.048692 48.2353
2034.522139 49.0196
2034.995641 51.9608
2035.469197 51.3726
2035.942809 50.1961
2036.416476 54.3137
2036.890199 50
2037.363976 48.8235
2037.837809 49.4118
2038.311696 51.9608
2038.785639 55.098
2039.259637 56.2745
2039.733690 50.7843
2040.207798 55.2941
2040.681962 58.2353
2041.156180 56.8627
2041.630454 60
2042.104783 61.1765
2042.579167 64.3137
2043.053606 60.5882
2043.528100 64.3137
2044.002650 62.1569
2044.477254 60.1961
2044.951914 68.2353
2045.426629 62.1569
2045.901399 62.7451
这是has_header
中的错误还是has_header
的启发式失败的某些边缘情况?
我注意到的一些差异:
csv.py中第394行的变量header
为['SPEC#: 1, SIZE: 18473, TIME: 0.000000']
(长度为1的列表),表示正确确定标题的第一个文件['SPEC#: 1, SIZE: 184', '4, TIME: 0.000000']
(第二个文件的长度列表)。
在为每列创建数据类型字典后,第一个文件的变量columnTypes
为{0: None}
,第二个文件为{}
。
答案 0 :(得分:13)
当我尝试这个时(data1
是你的第一个,data2
是你的第二个样本:
for data in (data1, data2):
s=csv.Sniffer()
d = s.sniff(data)
print(repr(d.delimiter))
我明白了:
' '
'7'
这意味着问题实际上是分隔符的检测。正如您所看到的in docstring of _guess_delimiter,嗅探器使用频率分析来确定分隔符的最佳候选者,在这种情况下恰好是'7'
而不是空格。
如果您将第二个样本视为带有'7'
分隔符的csv,您将得到:
[['SPEC#: 1, SIZE: 184', '3, TIME: 0.000000'],
['1998.304312 2.15686'],
['1998.', '', '3585 3.13', '25'],
['1999.242914 3.13', '25'],
['1999.', '12298 2.', '451'],
['2000.181', '36 2.94118'],
['2000.651230 2.94118'], ...]
其中第一行是完全可接受的数据行而不是标题。
不幸的是,has_header
不允许您手动指定使用过的方言,但您可以修补嗅探器的sniff
方法,以使用正确的分隔符' '
返回方言。