我必须将大txt文件的信息放入pandas数据框中。 文本文件的格式是这样的(我无法以任何方式更改它):
o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o
Z_0 S_1 C_1
foo bar
foo_1 foo_2 foo_3 foo_4
0.5 1.2 3.5 2.4
X[m] Y[m] Z[m] alfa[-] beta[-]
-2.17142783E-04 3.12000068E-03 3.20351664E-01 3.20366857E+01 3.20366857E+01
-7.18630964E-04 2.99634764E-03 3.20343560E-01 3.20357573E+01 3.20357573E+01
-2.85056979E-03 -4.51947006E-03 3.20079900E-01 3.20111805E+01 3.20111805E+01
o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o
Z_0 S_2 C_1
foo bar
foo_1 foo_2 foo_3 foo_4
0.5 1.2 3.5 2.4
X[m] Y[m] Z[m] alfa[-] beta[-]
-2.17142783E-04 3.12000068E-03 3.20351664E-01 3.20366857E+01 3.20366857E+01
-7.18630964E-04 2.99634764E-03 3.20343560E-01 3.20357573E+01 3.20357573E+01
-2.85056979E-03 -4.51947006E-03 3.20079900E-01 3.20111805E+01 3.20111805E+01
o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o
Z_1 S_3 C_1
foo bar
foo_1 foo_2 foo_3 foo_4
0.5 1.2 3.5 2.4
X[m] Y[m] Z[m] alfa[-] beta[-]
-2.17142783E-04 3.12000068E-03 3.20351664E-01 3.20366857E+01 3.20366857E+01
-7.18630964E-04 2.99634764E-03 3.20343560E-01 3.20357573E+01 3.20357573E+01
-2.85056979E-03 -4.51947006E-03 3.20079900E-01 3.20111805E+01 3.20111805E+01
原始文件的行数超过65K。
我想创建一个包含该文件信息的唯一数据框,包括分隔符后第一行中包含的信息。我写了一个工作代码:
import os
import pandas as pd
my_path = r"C:\Users\212744206\Desktop\COSO"
my_file= os.path.join(my_path ,'my_file.dat')
istart = False
with open(my_file) as fp:
for i, line in enumerate(fp):
if (line[0] != 'o'):
if line.split()[0][0] == 'Z':
iZ = int((line.split()[0]).split('_')[1])
iS = int((line.split()[1]).split('_')[1])
iC = int((line.split()[2]).split('_')[1])
elif (line.split()[0] == 'X[m]') or (len(line.split()) == 2) or (len(line.split()) == 4):
continue
else:
dfline = pd.DataFrame(line.split())
dfline = dfline.transpose()
dfline.insert(0, column='C' , value=iC)
dfline.insert(0, column='S' , value=iS)
dfline.insert(0, column='Z' , value=iZ)
if istart == False:
df_zone = dfline.copy()
istart = True
else:
df_zone = df_zone.append(dfline, ignore_index=True, sort=False)
print(df_zone)
...但是对于我的应用程序来说非常慢(结尾处的打印显然是出于调试原因,我不会在大文件中使用它)。我该如何以更“ pythonic”和有效的方式编写它?接受所有建议!谢谢
编辑: 不幸的是,我的“有用”数据可以有3、4、5或任意数量的行...此外,我需要解析“ Z_0 S_1 C_1”行,因为我需要这样的输出:
Z S C 0 1 2 3 4
0 0 1 1 -2.17142783E-04 3.12000068E-03 3.20351664E-01 3.20366857E+01 3.20366857E+01
1 0 1 1 -7.18630964E-04 2.99634764E-03 3.20343560E-01 3.20357573E+01 3.20357573E+01
2 0 1 1 -2.85056979E-03 -4.51947006E-03 3.20079900E-01 3.20111805E+01 3.20111805E+01
3 0 2 1 -2.17142783E-04 3.12000068E-03 3.20351664E-01 3.20366857E+01 3.20366857E+01
4 0 2 1 -7.18630964E-04 2.99634764E-03 3.20343560E-01 3.20357573E+01 3.20357573E+01
5 0 2 1 -2.85056979E-03 -4.51947006E-03 3.20079900E-01 3.20111805E+01 3.20111805E+01
6 1 3 1 -2.17142783E-04 3.12000068E-03 3.20351664E-01 3.20366857E+01 3.20366857E+01
7 1 3 1 -7.18630964E-04 2.99634764E-03 3.20343560E-01 3.20357573E+01 3.20357573E+01
8 1 3 1 -2.85056979E-03 -4.51947006E-03 3.20079900E-01 3.20111805E+01 3.20111805E+01
答案 0 :(得分:1)
不附加数据框。这是一个非常慢的操作。理想情况下,我将执行两次遍历:一次遍历文件以 count 行,然后倒退文件,创建适当大小的数据框,然后通过直接索引将其填充到第二遍中
作为微优化,请注意您多次执行line.split()
-应该对其进行缓存。
答案 1 :(得分:1)
主要性能瓶颈一直在附加到数据帧。相反,您可以创建一个数据缓冲区,并在缓冲区溢出后扩展该缓冲区。下面的代码生成约100,000行数据的综合数据集,然后解析相应的数据文件:
import pandas as pd
import numpy as np
from itertools import combinations_with_replacement
from scipy.misc import comb
from time import time
np.random.seed(0)
# Array buffer increment size
array_size = 1000
# Data file (output and input)
filename = "stack_output.dat"
def generate_data(m):
"""Generate synthetic (dummy) data to test performance"""
# Weird string appearing in the example data
sep_string = "".join(["o--"]*26)
sep_string += "o\n"
# Generate ZSC data, which seem to be combinatoric in nature
x = np.arange(m)
Ngroups = comb(m, 3, exact=True, repetition=True)
# For each group of ZSC, generate a random number of lines of data
# (between 2 and 8 lines)
Nrows = np.random.randint(low=2, high=8, size=Ngroups)
# Open file and write data
with open(filename, "w") as f:
# Loop over all values of ZSC (000, 001, 010, 011, etc.)
for n, ZSC in enumerate(combinations_with_replacement(x, 3)):
# Generate random data
rand_data = np.random.rand(Nrows[n], 5)
# Write (meta) data to file
f.write(sep_string)
f.write("Z_%d S_%d C_%d\n" % ZSC)
f.write("foo bar\n")
f.write("X[m] Y[m] Z[m] alpha[-] beta[-]\n")
for data in rand_data:
f.write("%.8e %.8e %.8e %.8e %.8e\n" % tuple(data))
return True
def grow_array(x):
"""Helper function to expand an array"""
buf = np.zeros((array_size, x.shape[1])) * np.nan
return np.vstack([x, buf])
def parse_data():
"""Parse the data using a growing buffer"""
# Number of lines of meta data (i.e. line that don't
# contain the XYZ alpha beta values
Nmeta = 3
# Some counters
Ndata = 0
group_index = 0
# Data buffer
all_data = np.zeros((array_size, 8)) * np.nan
# Read filename
with open(filename, "r") as f:
# Iterate over all lines
for i, line in enumerate(f):
# If we're at that weird separating line, we know we're at the
# start of a new group of data, defined by Z, S, C
if line[0] == "o":
group_index = i
# If we're one line below the separator, get the Z, S, C values
elif i - group_index == 1:
ZSC = line.split()
# Extract the number from the string
Z = ZSC[0][2:]
S = ZSC[1][2:]
C = ZSC[2][2:]
ZSC_clean = np.array([Z, S, C])
# If we're in a line below the meta data, extract the XYZ values
elif i - group_index > Nmeta:
# Split the numbers in the line
data = np.array(line.split(), dtype=float)
# Check if the data still fits in buffer.
# If not: expand the buffer
if Ndata == len(all_data)-1:
all_data = grow_array(all_data)
# Populate the buffer
all_data[Ndata] = np.hstack([ZSC_clean, data])
Ndata += 1
# Convert the buffer to a pandas dataframe (and clip the unpopulated
# bits of the buffer, which are still NaN)
df = pd.DataFrame(all_data, columns=("Z", "S", "C", "X", "Y", "Z", "alpha", "beta")).dropna(how="all")
return df
t0 = time()
generate_data(50)
t1 = time()
data = parse_data()
t2 = time()
print("Data size: \t\t\t %i" % len(data))
print("Rendering data: \t %.3e s" % (t1 - t0))
print("Parsing data: \t\t %.3e s" % (t2 - t1))
结果:
Data size: 99627
Rendering data: 3.360e-01 s
Parsing data: 1.356e+00 s
这足以满足您的需求吗?
先前的参考答案(假定数据文件具有某些结构)
您可以使用pandas.read_csv
中的skiprows
功能。在您的示例中,只有9的倍数的最后3行包含有用的数据,因此,如果行索引是6、7或8,则可以将skiprows
与返回True
的函数一起使用(从0开始)为9的每个倍数:
import pandas as pd
filename = "data.dat"
data = pd.read_csv(
filename, names=("X", "Y", "Z", "alpha", "beta"), delim_whitespace=True,
skiprows=lambda x: x % 9 < 6,
)
print(data)