Python的CSV模块非常方便csv.Sniffer().has_header()
method。
我无法弄清楚它需要多少行才能准确判断文件是否有标题。
它通常适用于有两行或三行的CSV,还是需要更多像五十行才能准确?
对于上下文,这是我的函数 - 你可以看到我有一个检查说“如果文件少于X行,不要允许嗅探标题”,目前我已经将X设置为3,而不是确定我是否需要更高或甚至可以设置为2.
import csv
# input_file_has_header can be True, False, or 'Auto' if unsure.
# input_file_has_header must be specified when file has less than 3 rows
# because CSV's with two rows sometimes have a header and sometimes don't
# and I don't understand the magic underlying the csv.Sniffer().has_header() method
def csv_to_object_dict(input_csv, input_file_has_header='Auto', object_id_column=0, header_keys=[]):
with open(input_csv,'rU') as object_file:
object_reader = csv.reader(object_file)
if input_file_has_header == 'Auto':
while row_count < 5:
for row in object_reader:
row_count += 1
if input_file_has_header == True or (input_file_has_header == 'Auto' and csv.Sniffer().has_header(object_file.read(2048)) == True and row_count > 3):
next(object_reader, header_keys) #not sure this is correct
print 'printing header keys ', header_keys # debug
assert header_keys != [], "File %s appears to have a header row, but there was a problem parsing it because header_keys remains empty" % input_csv
for row in object_reader:
print 'printing new row ', row #debug
if object_id_column not in object_dict:
object_dict[object_id_column] = {}
for key in header_keys:
object_dict[object_id_column][key]= #value in the row that matches the key
答案 0 :(得分:3)
如有疑问,请深入了解来源:
def has_header(self, sample):
# Creates a dictionary of types of data in each column. If any
# column is of a single type (say, integers), *except* for the first
# row, then the first row is presumed to be labels. If the type
# can't be determined, it is assumed to be a string in which case
# the length of the string is the determining factor: if all of the
# rows except for the first are the same length, it's a header.
# Finally, a 'vote' is taken at the end for each column, adding or
# subtracting from the likelihood of the first row being a header.
通过该方法快速浏览表明它不会尝试强制执行最少数量的非标题行;因此,根据上述规则,它将适用于只有两行的文件。