Question

我有一个大的csv，我加载如下

df=pd.read_csv('my_data.tsv',sep='\t',header=0, skiprows=[1,2,3])

我在加载过程中遇到了几个错误。

首先，如果我没有指定warn_bad_lines=True,error_bad_lines=False，我会：

标记数据时出错。 C错误：看到了329867行中的22个字段 24
其次，如果我使用上面的选项，我现在得到：

CParserError：标记数据时出错。 C错误：字符串中的EOF 从第32357585行开始

问题是：如何查看这些不良行以了解发生了什么？是否有可能read_csv返回这些虚假的行？

我尝试了以下提示（Pandas ParserError EOF character when reading multiple csv files to HDF5）：

from pandas import parser

try:
  df=pd.read_csv('mydata.tsv',sep='\t',header=0, skiprows=[1,2,3])
except (parser.CParserError) as detail:
  print  detail

但仍然得到

标记数据时出错。 C错误：看到了329867行中的22个字段 24

Answer 1

就我而言，添加分隔符有帮助：

data = pd.read_csv('/Users/myfile.csv', encoding='cp1251', sep=';')

Answer 2

我们可以从错误中获取行号并打印行以查看其外观

尝试：

    // Grab the CGImage, w = width, h = height...

    let context = CGContext(data: nil, width: w, height: h, bitsPerComponent: bpc, bytesPerRow: (bpp / 8) * w, space: colorSpace!, bitmapInfo: bitmapInfo.rawValue)

    let flip = CGAffineTransform(a: 1, b: 0, c: 0, d: -1, tx: 0, ty: CGFloat(h))
    context?.concatenate(flip)
    context?.draw(cgImage, in: CGRect(x: 0, y: 0, width: CGFloat(w), height: CGFloat(h)))

    let textureDescriptor = MTLTextureDescriptor()
    textureDescriptor.pixelFormat = .rgba8Unorm
    textureDescriptor.width = w
    textureDescriptor.height = h

    guard let data = context?.data else {print("No data in context."); return nil}

    let texture = device.makeTexture(descriptor: textureDescriptor)
    texture?.replace(region: MTLRegionMake2D(0, 0, w, h), mipmapLevel: 0, withBytes: data, bytesPerRow: 4 * w)

    return texture

Answer 3

我将分两部分给出答案：

第1部分： op询问如何输出这些不良行，为回答这个问题，我们可以在简单的代码中使用python csv模块，如下所示：

import csv
file = 'your_filename.csv' # use your filename
lines_set = set([100, 200]) # use your bad lines numbers here

with open(file) as f_obj:
    for line_number, row in enumerate(csv.reader(f_obj)):
        if line_number > max(lines_set):
            break
        elif line_number in lines_set: # put your bad lines numbers here
            print(line_number, row)

我们也可以将其放在更通用的功能中，例如：

import csv


def read_my_lines(file, lines_list, reader=csv.reader):
    lines_set = set(lines_list)
    with open(file) as f_obj:
        for line_number, row in enumerate(csv.reader(f_obj)):
            if line_number > max(lines_set):
                break
            elif line_number in lines_set:
                print(line_number, row)


if __name__ == '__main__':
    read_my_lines(file='your_filename.csv', lines_list=[100, 200])

第2部分：错误原因：

如果不使用您使用的文件样本，很难诊断出这种问题。但是你应该试试这个..

pd.read_csv(filename)

它解析文件没有错误吗？如果是这样，我会解释原因。

从第一行推断出列数。

通过使用跳过行和header=0，您对前3行进行了转义，我想其中包含列名或应包含正确列数的标题。

基本上，您限制了解析器的工作。

因此无需跳行就可以解析，或者header=0然后重新索引为以后需要的内容。

注释：

如果不确定文件中使用了什么分隔符，请使用sep=None，但这会比较慢。

来自pandas.read_csv文档：

sep：str，默认为“，”要使用的分隔符。如果sep为None，则C引擎无法自动检测到分隔符，但会进行Python解析引擎可以，这意味着将使用后者并自动检测使用Python的内置嗅探器工具csv.Sniffer进行分隔。在此外，分隔符超过1个字符且与'\ s +'不同将被解释为正则表达式，并将强制使用 Python解析引擎。请注意，正则表达式定界符易于忽略引用的数据。正则表达式示例：“ \ r \ t”

link

在Pandas read_csv期间标记数据时出错。如何真正看到坏线？

3 个答案: