阅读熊猫中的自定义文本格式

时间:2018-07-24 08:02:41

标签: python python-3.x pandas optimization io

我有一个2Gb文本文件,格式如下。

LABEL NO.          =      100001
   COL1           COL2           COL3           COL4           COL5           COL6           COL7           ID
  -1.194298E+07  -8.277112E+07  -3.654541E+07   7.397563E+06   2.007487E+07  -1.730773E+07   3.122298E+02  1.0000000E+00
  -1.196239E+07  -8.661736E+07  -3.674206E+07   7.661088E+06   2.119962E+07  -1.719316E+07   3.122298E+02  2.0000000E+00
  -1.862518E+07  -1.370518E+08  -5.674587E+07   6.354599E+06   2.785788E+07  -2.635757E+07   3.511743E+02  3.0000000E+00
  -1.870298E+07  -1.385814E+08  -5.688693E+07   6.273323E+06   2.788952E+07  -2.641291E+07   3.511743E+02  4.0000000E+00
  -1.870021E+07  -1.385812E+08  -5.687748E+07   6.270844E+06   2.788576E+07  -2.640796E+07   3.511743E+02  5.0000000E+00
  -1.917867E+07  -1.366550E+08  -5.872183E+07   6.969717E+06   2.885888E+07  -2.735340E+07   3.511743E+02  6.0000000E+00
  -1.891841E+07  -1.313277E+08  -5.767392E+07   6.362409E+06   2.700424E+07  -2.708990E+07   3.511743E+02  7.0000000E+00
.....................(similar rows repeating)
LABEL NO.          =      100002
   COL1           COL2           COL3           COL4           COL5           COL6           COL7           ID
  -1.642765E+07  -9.443663E+07  -3.835620E+07   1.219941E+07   2.479202E+07  -2.056075E+07   3.115766E+02  1.0000000E+00
  -1.655851E+07  -9.891013E+07  -3.871946E+07   1.264886E+07   2.604418E+07  -2.052297E+07   3.115766E+02  2.0000000E+00
  -2.561388E+07  -1.552053E+08  -5.951435E+07   1.287625E+07   3.402213E+07  -3.122215E+07   3.520203E+02  3.0000000E+00
  -2.569815E+07  -1.566586E+08  -5.962675E+07   1.283599E+07   3.409514E+07  -3.126740E+07   3.520203E+02  4.0000000E+00
  -2.569427E+07  -1.566549E+08  -5.961668E+07   1.283294E+07   3.409080E+07  -3.126145E+07   3.520203E+02  5.0000000E+00
  -2.641499E+07  -1.559677E+08  -6.166505E+07   1.359925E+07   3.513394E+07  -3.244093E+07   3.520203E+02  6.0000000E+00
  -2.592857E+07  -1.495959E+08  -6.038651E+07   1.267992E+07   3.303278E+07  -3.201815E+07   3.520203E+02  7.0000000E+00
.....................(similar rows repeating)
.................... (similar sections repeating)

重复的节数将大于重复的行数。 (节数不超过一百万,行数不超过几十)。

将其读入熊猫数据框的最佳(最快)方法是什么?我希望顶部的标签和id列成为数据框的索引。

我正在使用以下代码进行读取,但是现在在我的机器上大约需要2分钟。

def read_data_file(data_file)
    data_matrix = []
    label_number = None
    with open(data_file, "r") as contents:
        for line in contents:
            stripped_line = line.lower().strip()
            if "label no." in stripped_line:
                label_number = int(stripped_line[stripped_line.find(" = "):][3:])
            else:
                line_items = stripped_line.split()
                if data_matrix == [] and label_number is not None:
                    data_matrix = [["Label"] + line_items] # Setting the column names
                    # Number of columns is variable, but is constant in one file
                elif label_number is not None:
                    try:
                        line_as_floats = map(float, line_items)
                        row = [label_number] + list(line_as_floats)
                        data_matrix.append(row)
                    except ValueError:
                        pass
    df = pd.DataFrame(data_matrix[1:],columns=data_matrix[0])
    df.set_index(["Label", "id"], inplace=True)
    return df

有什么方法可以优化read_data_file函数?

谢谢!

0 个答案:

没有答案