我有一个2Gb文本文件,格式如下。
LABEL NO. = 100001
COL1 COL2 COL3 COL4 COL5 COL6 COL7 ID
-1.194298E+07 -8.277112E+07 -3.654541E+07 7.397563E+06 2.007487E+07 -1.730773E+07 3.122298E+02 1.0000000E+00
-1.196239E+07 -8.661736E+07 -3.674206E+07 7.661088E+06 2.119962E+07 -1.719316E+07 3.122298E+02 2.0000000E+00
-1.862518E+07 -1.370518E+08 -5.674587E+07 6.354599E+06 2.785788E+07 -2.635757E+07 3.511743E+02 3.0000000E+00
-1.870298E+07 -1.385814E+08 -5.688693E+07 6.273323E+06 2.788952E+07 -2.641291E+07 3.511743E+02 4.0000000E+00
-1.870021E+07 -1.385812E+08 -5.687748E+07 6.270844E+06 2.788576E+07 -2.640796E+07 3.511743E+02 5.0000000E+00
-1.917867E+07 -1.366550E+08 -5.872183E+07 6.969717E+06 2.885888E+07 -2.735340E+07 3.511743E+02 6.0000000E+00
-1.891841E+07 -1.313277E+08 -5.767392E+07 6.362409E+06 2.700424E+07 -2.708990E+07 3.511743E+02 7.0000000E+00
.....................(similar rows repeating)
LABEL NO. = 100002
COL1 COL2 COL3 COL4 COL5 COL6 COL7 ID
-1.642765E+07 -9.443663E+07 -3.835620E+07 1.219941E+07 2.479202E+07 -2.056075E+07 3.115766E+02 1.0000000E+00
-1.655851E+07 -9.891013E+07 -3.871946E+07 1.264886E+07 2.604418E+07 -2.052297E+07 3.115766E+02 2.0000000E+00
-2.561388E+07 -1.552053E+08 -5.951435E+07 1.287625E+07 3.402213E+07 -3.122215E+07 3.520203E+02 3.0000000E+00
-2.569815E+07 -1.566586E+08 -5.962675E+07 1.283599E+07 3.409514E+07 -3.126740E+07 3.520203E+02 4.0000000E+00
-2.569427E+07 -1.566549E+08 -5.961668E+07 1.283294E+07 3.409080E+07 -3.126145E+07 3.520203E+02 5.0000000E+00
-2.641499E+07 -1.559677E+08 -6.166505E+07 1.359925E+07 3.513394E+07 -3.244093E+07 3.520203E+02 6.0000000E+00
-2.592857E+07 -1.495959E+08 -6.038651E+07 1.267992E+07 3.303278E+07 -3.201815E+07 3.520203E+02 7.0000000E+00
.....................(similar rows repeating)
.................... (similar sections repeating)
重复的节数将大于重复的行数。 (节数不超过一百万,行数不超过几十)。
将其读入熊猫数据框的最佳(最快)方法是什么?我希望顶部的标签和id列成为数据框的索引。
我正在使用以下代码进行读取,但是现在在我的机器上大约需要2分钟。
def read_data_file(data_file)
data_matrix = []
label_number = None
with open(data_file, "r") as contents:
for line in contents:
stripped_line = line.lower().strip()
if "label no." in stripped_line:
label_number = int(stripped_line[stripped_line.find(" = "):][3:])
else:
line_items = stripped_line.split()
if data_matrix == [] and label_number is not None:
data_matrix = [["Label"] + line_items] # Setting the column names
# Number of columns is variable, but is constant in one file
elif label_number is not None:
try:
line_as_floats = map(float, line_items)
row = [label_number] + list(line_as_floats)
data_matrix.append(row)
except ValueError:
pass
df = pd.DataFrame(data_matrix[1:],columns=data_matrix[0])
df.set_index(["Label", "id"], inplace=True)
return df
有什么方法可以优化read_data_file
函数?
谢谢!