我们一直试图找到一种方法来解析由PEST分析使用Python生成的棘手文本文件。它显示了超过30,000个观测值的63个不同变量的测量值。这是输出的示例(显示3 /> 30,000)
cmfa cmfb cmfc cmfd cmla cmlb cmlc cmld
cmle cgfa cgfb cgfc cgfd cgfe dgfa dgfb
dgfc dgfd icfa icfb icfc icfd vawa vawb
vawc vawd vawe vawf vswa vswb vswc vswd
vswe chfa chfb chfc chfd chfe cgwa cgwb
cgwc cgwd cgwe crta crtb crtc crtd crte
icha ichb ichc ichd iche csea cseb csec
csed csee csef caqa caqb crsa crsb
0 -1.900000E-03 1.080000E-02 3.150000E-02 0.00000 0.00000 0.00000 0.00000 -3.020000E-02
0.00000 -1.870000E-02 0.00000 4.600000E-03 0.00000 0.00000 0.00000 4.510000E-02
0.00000 0.00000 3.650000E-02 -7.000000E-03 -2.100000E-03 -2.000000E-04 3.200000E-03 8.000000E-03
-7.000000E-04 -1.500000E-02 0.00000 4.800000E-03 1.900000E-03 4.000000E-04 2.500000E-03 2.500000E-03
-1.400000E-02 0.00000 0.00000 0.00000 0.00000 0.00000 -3.200000E-03 -8.060000E-02
-0.126500 0.298400 0.00000 0.00000 0.00000 0.00000 0.00000 8.000000E-04
-1.900000E-03 1.400000E-03 0.00000 0.00000 -3.200000E-03 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 -1.200000E-02 1.930000E-02
1 -1.800000E-03 1.140000E-02 1.850000E-02 0.00000 0.00000 0.00000 0.00000 -2.600000E-02
0.00000 -8.200000E-03 0.00000 1.200000E-03 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 2.560000E-02 -6.100000E-03 -1.100000E-03 0.00000 3.000000E-03 7.400000E-03
-7.000000E-04 -1.410000E-02 0.00000 5.000000E-03 1.900000E-03 3.000000E-04 2.300000E-03 2.300000E-03
-1.330000E-02 0.00000 0.00000 0.00000 0.00000 0.00000 -3.400000E-03 -8.410000E-02
-0.123500 0.301900 0.00000 0.00000 0.00000 0.00000 0.00000 1.200000E-03
-2.000000E-03 1.400000E-03 0.00000 0.00000 -3.200000E-03 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 -1.280000E-02 2.050000E-02
2 -3.300000E-03 6.500000E-03 4.040000E-02 0.00000 0.00000 0.00000 0.00000 -7.060000E-02
4.840000E-02 -0.112500 0.110300 0.00000 0.00000 0.00000 1.10330 0.00000
0.00000 0.00000 3.940000E-02 -8.500000E-03 -1.120000E-02 6.600000E-03 5.700000E-03 1.430000E-02
-1.300000E-03 -2.470000E-02 0.00000 3.700000E-03 2.200000E-03 5.000000E-04 4.300000E-03 4.500000E-03
-2.250000E-02 0.00000 0.00000 0.00000 0.00000 0.00000 -2.000000E-03 -5.840000E-02
-0.157300 0.292400 0.00000 0.00000 0.00000 0.00000 0.00000 -3.600000E-03
-1.700000E-03 1.200000E-03 0.00000 0.00000 -3.400000E-03 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 -7.400000E-03 1.180000E-02
3 -2.200000E-03 1.040000E-02 3.500000E-02 0.00000 0.00000 0.00000 0.00000 -4.390000E-02
0.00000 -3.170000E-02 2.590000E-02 0.00000 0.00000 0.00000 0.259400 0.00000
0.00000 0.00000 3.920000E-02 -1.030000E-02 -3.500000E-03 1.500000E-03 3.600000E-03 9.000000E-03
-9.000000E-04 -1.680000E-02 0.00000 4.700000E-03 2.000000E-03 3.000000E-04 2.700000E-03 2.800000E-03
-1.560000E-02 0.00000 0.00000 0.00000 0.00000 0.00000 -3.200000E-03 -7.920000E-02
-0.131600 0.302200 0.00000 0.00000 0.00000 0.00000 0.00000 3.000000E-04
-2.000000E-03 1.300000E-03 0.00000 0.00000 -3.300000E-03 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 -1.180000E-02 1.880000E-02
字母代码(cmfa,cmfb等)是63个变量的名称。每个字母代码变量与以下每个文本块的相同位置中的数字相关。
第一个数字块用于观察0,下一个用于观察1的块,依此类推,用于超过30,000次观察。
我们希望找到一种方法将其转换为文本文件(最好是.csv)。在我的文本示例中,它将有63列和3行(标识符为+1)。每列都标有相应的字母代码(cmfa等)
如果可能,我们希望在包含任意数量的列和任意数量的观察值的文件上运行
答案 0 :(得分:1)
使用简单的python解析你提供的文件(独立于文件中的行数)的方法,可以使用正则表达式完成更好的实现,但我会留给你进一步尝试:
#Importing required libraries
import numpy as np
import csv
#Open input file
with open('input.txt','rb') as f:
line = f.read().splitlines()
#Read file and do some parsing
line2 = []
for l in line:
z = l.split(" ")
l2 = []
for val in z:
if not(val==''):
l2.append(val)
if len(l2)==9:
line2.append(l2[1:9])
elif len(l2)==7 or len(l2)==8:
line2.append(l2)
#Remove unnecessary rows and do type conversion to float
pl = np.arange(0,len(line2)+1,8)
line3 = []
for i in np.arange(0,len(pl)-1):
z = line2[pl[i]:pl[i+1]]
z2 = [item for sublist in z for item in sublist]
if i==0:
line3.append(z2)
else:
line3.append([float(i) for i in z2])
#Write to output file
with open('output.csv','wb') as f:
wr = csv.writer(f)
for row in line3:
wr.writerow(row)
如果您想保留索引:
#Importing required libraries
import numpy as np
import csv
#Open input file
with open('input.txt','rb') as f:
line = f.read().splitlines()
#Read file and do some parsing
line2 = []
for l in line:
z = l.split(" ")
l2 = []
for val in z:
if not(val==''):
l2.append(val)
if not(len(l2)==0):
line2.append(l2)
#Remove unnecessary rows and do type conversion to float
pl = np.arange(0,len(line2)+1,8)
line3 = []
for i in np.arange(0,len(pl)-1):
if i==0:
z = line2[pl[i]:pl[i+1]]
z2 = [item for sublist in z for item in sublist]
line3.append(['']+z2)
else:
z = line2[pl[i]:pl[i+1]]
z2 = [item for sublist in z for item in sublist]
line3.append([float(i) for i in z2])
#Write to output file
with open('output.csv','wb') as f:
wr = csv.writer(f)
for row in line3:
wr.writerow(row)
答案 1 :(得分:0)
您可以使用mmap
和正则表达式来解析文件,而无需将整个文件读入内存。
类似的东西:
import re
import mmap
import os
size=os.stat(fn_in).st_size
with open(fn_in, "r") as fin, open(fn_out, "w") as fout:
data = mmap.mmap(fin.fileno(), size, access=mmap.ACCESS_READ)
for idx, m in enumerate(re.finditer(r"(.*?)(?:(?:^\s*$)|\Z)", data, re.M | re.S)):
block=m.group(0).strip()
if not block:
continue
if idx==0:
fout.write("O_N,"+",".join(block.split())+"\n")
else:
fout.write(",".join(block.split())+"\n")