如何用Python阅读复杂的.dat文件?

时间:2016-11-22 09:53:39

标签: python data-structures dataframe

我有一个很大的.dat文件,其格式如下:

  "Trajectory"  0

"Type : "
    Transmitted

"Collisions"

"X" "Y" "Z" "Energy"    
-17.418 11.0038 -2633.51    300 
-7.80195    4.90819 -1317.76    300 
-2.98663    1.85574 -658.878    300 
-0.578976   0.329517    -329.439    300 
-0.278019   0.138739    -288.259    300 
-0.12754    0.0433497   -267.669    300
''
''
''
''
56.1784 -56.9043    2103.34 297.645224483   
58.9321 -57.4033    2155.91 297.617470093   
78.4242 -59.0752    2635.51 297.364385221   
78.8647 -59.113 2646.35 297.358666592   
"-----------------------------------------------------------------"
"Trajectory"    1

"Type : "
    Transmitted

"Collisions"

"X" "Y" "Z" "Energy"    
19.5684 -1.57545    -2633.51    300 
8.78275 -0.663686   -1317.76    300 
3.38175 -0.207111   -658.878    300 
0.931759    0   -360    300 
0.681244    0.0211774   -329.439    300 
0.343681    0.0497133   -288.259    300 

然后继续前进一百个“轨迹”。 我的目标是绘制所有轨迹,所以我想知道如何从这个.dat文件中拉出每条轨迹的X,Y,Z和能量数据。

谢谢!

2 个答案:

答案 0 :(得分:0)

我认为如果数据中没有NaN值,则可以使用sample csv file):

import pandas as pd

df = pd.read_csv('sample.csv', sep='\t', names=['X','Y','Z','Energy'])
#print (df)

#remove all rows where in column X is value X
df = df[df.X != 'X']
#add new column groups if column X contains 'Trajectory' get value of column Y
df['groups'] = df.loc[df.X.str.contains('Trajectory', na=False), 'X']
#forward fill NaN of column groups
df['groups'].ffill(inplace=True)
#remove all rows with values NaN
df = df.dropna().reset_index(drop=True)
#convert all values to float
df[['X','Y','Z','Energy']] = df[['X','Y','Z','Energy']].astype(float)
print (df)

            X          Y         Z      Energy        groups
0  -17.418000  11.003800 -2633.510  300.000000  Trajectory 0
1   -2.986630   1.855740  -658.878  300.000000  Trajectory 0
2   -0.578976   0.329517  -329.439  300.000000  Trajectory 0
3   -0.278019   0.138739  -288.259  300.000000  Trajectory 0
4   -0.127540   0.043350  -267.669  300.000000  Trajectory 0
5   56.178400 -56.904300  2103.340  297.645224  Trajectory 0
6   58.932100 -57.403300   155.910  297.617470  Trajectory 0
7   78.424200 -59.075200  2635.510  297.364385  Trajectory 0
8   78.864700 -59.113000  2646.350  297.358667  Trajectory 0
9   19.568400  -1.575450 -2633.510  300.000000  Trajectory 1
10   8.782750  -0.663686 -1317.760  300.000000  Trajectory 1
11   3.381750  -0.207111  -658.878  300.000000  Trajectory 1
12   0.931759   0.000000  -360.000  300.000000  Trajectory 1
13   0.681244   0.021177  -329.439  300.000000  Trajectory 1
14   0.343681   0.049713  -288.259  300.000000  Trajectory 1

答案 1 :(得分:0)

此函数需要一个文件名并将文件解析为numpy structured array

def extract_trajectories(fn):
    import numpy
    d = []
    with open(fn, 'r') as f:
        trajectory =  0
        data = False
        for l in f:
            if '"Trajectory"' in l:
                trajectory = int(l.split()[1])
            if '"-----------------------------------------------------------------"' in l:
                data = False
            if data and not "''" in l:
                d.append(tuple([trajectory]+[float(x) for x in l.split()]))
            if '"X" "Y" "Z"' in l:
                data = True
    return numpy.array(d, dtype=[('Trajectory', 'i4'), ('X', 'f4'), ('Y', 'f4'), ('Z', 'f4'), ('Energy', 'f4')])

通常,无法为非标准文件布局编写自己的代码。

例如,要获取轨迹'X'的所有1值,您只需索引数组:

In [6]: d['X'][d['Trajectory']==1]
Out[6]: 
array([ 19.56839943,   8.78275013,   3.38175011,   0.931759  ,
         0.68124402,   0.34368101], dtype=float32)