Question

我有一些需要验证的庞大CSV文件；需要确保它们都由反勾`分隔。我有一个读者打开每个文件并打印其内容。只是想知道大家将通过不同的方式来验证每个值是否由反勾字符分隔

for csvfile in self.fullcsvpathfiles:
                   #print("file..")
            with open(self.fullcsvpathfiles[0], mode='r') as csv_file:
                    csv_reader = csv.DictReader(csv_file, delimiter = "`")
                    for row in csv_reader:
                            print (row)

不确定如何验证每个值是否由反引号分隔，否则将引发错误。这些桌子很大（不是那不是电的问题；））

Answer 1

方法1

使用pandas库，您可以使用pandas.read_csv（）函数读取带有sep ='`'（指定分隔符）的csv文件。如果它以良好的格式将文件解析为数据帧，那么您几乎可以肯定这很好。

此外，要自动化验证过程，您可以检查数据框中的NaN值数量是否在可接受的水平内。假设您的csv文件中没有很多空格（因此只能输入几个NaN值），则可以将NaN值的数量与您设置的阈值进行比较。

import pandas as pd
nan_threshold = 20
for csvfile in self.fullcsvpathfiles:
    my_df = pd.read_csv(csvfile, sep="`")    # if it fails at this step, then something (probably the delimiter) must be wrong
    nans = my_df.is_null().sum()
    if nans > nan_threshold:
        print(csvfile)  # make some warning here

有关pandas.read_csv（）的更多信息，请参见this page。

方法2

如评论中所述，您还可以检查文件每一行中分隔符的出现次数是否相等。

num_of_sep = -1  # initial value
# assume you are at the step of reading a file f
for line in f:
    num = line.count("`")
    if num_of_sep == -1:
        num_of_sep = num
    elif num != num_of_sep:
        print('Some warning here')

Answer 2

如果您不知道文件中有多少列，则可以检查以确保所有行都具有相同的列数-如果您希望标题（第一个）始终正确，请使用它来确定列数。

for csvfile in self.fullcsvpathfiles:
    with open(self.fullcsvpathfiles[0], mode='r') as csv_file:
        csv_reader = csv.DictReader(csv_file, delimiter = "`")
        ncols = len(next(csv_reader))
        if not all(len(row)==ncols for row in reader):
            #do something

for csvfile in self.fullcsvpathfiles:
    with open(self.fullcsvpathfiles[0], mode='r') as f:
        row = next(f)
        ncols = row.count('`')
        if not all(row.count('`')==ncols for row in f):
            #do something

如果您知道文件中有多少列...

for csvfile in self.fullcsvpathfiles:
    with open(self.fullcsvpathfiles[0], mode='r') as csv_file:
        #figure out how many columns it is suppose to have here?
        ncols = special_process()
        csv_reader = csv.DictReader(csv_file, delimiter = "`")
        if not all(len(row)==ncols for row in reader):
            #do something

for csvfile in self.fullcsvpathfiles:
    #figure out how many columns it is suppose to have here?
    ncols = special_process()
    with open(self.fullcsvpathfiles[0], mode='r') as f:
        #figure out how many columns it is suppose to have here?
        if not all(row.count('`')==ncols for row in f):
            #do something

Answer 3

如果知道预期元素的数量，则可以检查每一行

f=open(filename,'r')
for line in f:
    line=line.split("`")
    if line!=numElements:
        raise Exception("Bad file")

如果您知道不小心插入了分隔符，则也可以尝试恢复而不是引发异常。也许像这样：

line="`".join(line).replace(wrongDelimiter,"`").split("`")

当然，一旦您深入阅读文件，就不需要使用外部库来读取数据了。只需继续使用它即可。

如何验证CSV文件是否由某些字符定界（在本例中为反引号（`））

3 个答案: