使用Python和Pandas在文本文件中分割数据

时间:2015-02-11 16:54:33

标签: python python-3.x numpy matplotlib pandas

我从CFD模拟中得到以下数据:

  Average value for X = 0.5080000265E-0003 to 0.2489200234E-0001          
  Z = -.3141592741E+0001     
  Time = 0.7000032425E+0001     
       Y             P_g     
  0.1511904760E-0002  0.2565604063E+0006
  0.4535714164E-0002  0.2565349844E+0006
  0.7559523918E-0002  0.2565098906E+0006
  0.1058333274E-0001  0.2564848125E+0006
  0.1360714249E-0001  0.2564597656E+0006
  0.1663095318E-0001  0.2564346563E+0006
  0.1965476200E-0001  0.2564095625E+0006
         ...                 ...
         ...                 ...
  0.1259419441E+0001  0.2549983125E+0006
  0.1262443304E+0001  0.2549983125E+0006
  0.1265467167E+0001  0.2549983125E+0006
  0.1268491030E+0001  0.2549982656E+0006
  Time = 0.7010014057E+0001     
       Y             P_g     
  0.1511904760E-0002  0.2565604063E+0006
  0.4535714164E-0002  0.2565349844E+0006
  0.7559523918E-0002  0.2565098906E+0006
  0.1058333274E-0001  0.2564848125E+0006
         ...                 ...
         ...                 ...
  0.1259419441E+0001  0.2549983125E+0006
  0.1262443304E+0001  0.2549983125E+0006
  0.1265467167E+0001  0.2549983125E+0006
  0.1268491030E+0001  0.2549982656E+0006
  Time = 0.7020006657E+0001     
       Y             P_g     
  0.1511904760E-0002  0.2565604063E+0006
  0.1058333274E-0001  0.2564848125E+0006
         ...                 ...

从上面的示例中可以看出,数据被标记为Time的时间步标题拆分为多个垂直部分。在每个部分中,Y不会更改,但P_g会发生变化。要绘制数据,我需要在每个部分中P_g列在下一列中。例如,我需要重新创建数据:

      Y                0.7000032425E+1     0.7020006657E+1       ...
  0.1511904760E-0002  0.2565604063E+0006  0.2549982656E+0006  ...  
  0.4535714164E-0002  0.2565349844E+0006  0.2549982656E+0006  ...
  0.7559523918E-0002  0.2565098906E+0006  0.2549982656E+0006  ...
  0.1058333274E-0001  0.2564848125E+0006  0.2549982656E+0006  ...
  0.1360714249E-0001  0.2564597656E+0006  0.2549982656E+0006  ...

使用Pandas,我可以从文本文件中读取数据,并创建一个新的数据框,其中Y值作为索引(行),Time值作为列:

import pandas as pd

# Read in data from text file
# -------------------------------------------------------------------------

# data frame from text file contents, skip first 4 rows, separate by variable
# white space, no header
df = pd.read_table('ROP_s_SD.dat', skiprows=4, sep='\s*', header=None)

# Time data
# -------------------------------------------------------------------------

# data frame of the rows that contain the Time string
dftime = df.loc[df.ix[:,0].str.contains('Time')]

t = dftime[2].tolist()  # time list
idx = dftime.index      # index of rows containing Time string

# Y data
# -------------------------------------------------------------------------

# grab values for y to create index for new data frame
ido = idx[0]+2      # index of first y value
idf = idx[1]        # index of last y value
y = []              # empty list to store y values

for i in range(ido, idf):   # iterate through first section of y values
    v = df.ix[i, 0]         # get y value from data frame
    y.append(float(v))      # add y value to y list

# New data frame
# ------------------------------------------------------------------------

# empty data frame with y as index and t as columns
dfnew = pd.DataFrame(None, index=y, columns=t)
print('dfnew is \n', dfnew.head())

空数据框的头部dfnew.head()如下所示:

          7.000032 7.010014 7.020007 7.030043 7.040020 7.050035 7.060043  
0.001512      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
0.004536      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
0.007560      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
0.010583      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
0.013607      NaN      NaN      NaN      NaN      NaN      NaN      NaN   

         7.070004 7.080036 7.090022   ...    7.650011 7.660032 7.670026
0.001512      NaN      NaN      NaN   ...         NaN      NaN      NaN   
0.004536      NaN      NaN      NaN   ...         NaN      NaN      NaN   
0.007560      NaN      NaN      NaN   ...         NaN      NaN      NaN   
0.010583      NaN      NaN      NaN   ...         NaN      NaN      NaN   
0.013607      NaN      NaN      NaN   ...         NaN      NaN      NaN   

         7.680044 7.690029 7.700008 7.710012 7.720014 7.730019 7.740026  
0.001512      NaN      NaN      NaN      NaN      NaN      NaN      NaN  
0.004536      NaN      NaN      NaN      NaN      NaN      NaN      NaN  
0.007560      NaN      NaN      NaN      NaN      NaN      NaN      NaN  
0.010583      NaN      NaN      NaN      NaN      NaN      NaN      NaN  
0.013607      NaN      NaN      NaN      NaN      NaN      NaN      NaN  

[5 rows x 75 columns]

每列中的NaN应包含该特定P_g部分的Time值。如何将每个部分的P_g值添加到各自的列?

我正在阅读的文本文件可以下载here

2 个答案:

答案 0 :(得分:1)

看起来你已经完成了大部分的艰苦工作......以下几行将完成解开你的DataFrame:

# Add one more element to idx for correct indexing on the last column
idx = list(idx)
idx.append(len(df))

# Loop over the idx locations to fill the columns
for i in range(len(dfnew.columns)):
    dfnew.iloc[:, i] = df.iloc[idx[i]+2:idx[i+1], 1].values

前{3}列dfnew的负责人现在是这样的:

                    7.000032            7.010014            7.020007
0.001512  0.2565604063E+0006  0.2565604063E+0006  0.2565604063E+0006   
0.004536  0.2565349844E+0006  0.2565349844E+0006  0.2565349844E+0006   
0.007560  0.2565098906E+0006  0.2565098906E+0006  0.2565098906E+0006   
0.010583  0.2564848125E+0006  0.2564848125E+0006  0.2564848125E+0006   
0.013607  0.2564597656E+0006  0.2564597656E+0006  0.2564597656E+0006  

您有很多元素,因此查看数据的最佳方式可能是2D:

data = dfnew.astype(float).values
extent = [float(dfnew.columns[0]),
          float(dfnew.columns[-1]),
          float(dfnew.index[0]),
          float(dfnew.index[-1])]
import matplotlib.pyplot as plt
plt.imshow(data, extent=extent, origin='lower')
plt.xlabel('Time')
plt.ylabel('Y')

顺便说一下,看起来你的示例文件中每次P_g的所有值都是相同的......

答案 1 :(得分:0)

两件事。首先,也许您可​​以考虑如何将其减少到2d电子表格。每行应包含哪些列?我建议每行应包含TimeYP_g。也许这可以告诉您处理时髦输入格式的策略。

其次,您尝试绘制Y v。的P_g值。 Time?您的数据似乎有3个变量 - 您需要减少到2个维度才能生成2d图。是否要为特定P_g值绘制Time的平均值?或者你想要一个3d绘图,你可以在其中绘制Y v。每个P_gTime?假设您采用我上面提到的行/列结构,使用pandas可以轻松完成任何这些。查看pandas groupby功能。 Here's more detail on that

编辑:你已经澄清了我的两个问题。试试这个:

import pandas, sys, numpy                                                                                                                                                                                                                                                         
if sys.version_info[0] < 3:                                                                                                                                                                                                                                                       
    from StringIO import StringIO                                                                                                                                                                                                                                                 
else:                                                                                                                                                                                                                                                                             
    from io import StringIO                                                                                                                                                                                                                                                       

# main dataframe                                                                                                                                                                                                                                                                  
df = pandas.DataFrame(columns=['Time','Y','P_g'])                                                                                                                                                                                                                                 

text = open('ROP_s_SD.dat','r').read()                                                                                                                                                                                                                                            
chunks = text.split("Time = ")                                                                                                                                                                                                                                                    
# ignore first chunk                                                                                                                                                                                                                                                              
chunks = chunks[1:]                                                                                                                                                                                                                                                               
for chunk in chunks:                                                                                                                                                                                                                                                              
    time_str, rest_str = chunk.split('\n',1)                                                                                                                                                                                                                                      
    time = float(time_str)                                                                                                                                                                                                                                                        
    chunk_df = pandas.DataFrame.from_csv(StringIO(rest_str), sep=r'\s*', index_col=False)                                                                                                                                                                                         
    chunk_df['Time'] = time                                                                                                                                                                                                                                                       
    # add new content to main dataframe                                                                                                                                                                                                                                           
    df = df.append(chunk_df)                                                                                                                                                                                                                                                      
# you should now have a DataFrame with columns 'Time','Y','P_g'                                                                                                                                                                                                                   
assert sorted(df.columns) == ['P_g', 'Time', 'Y']                                                                                                                                                                                                                                 

# iterate over unique values of time                                                                                                                                                                                                                                              
times = sorted(list(set(df['Time'])))                                                                                                                                                                                                                                             
assert len(times) == len(chunks)                                                                                                                                                                                                                                                  
for i,time in enumerate(times):                                                                                                                                                                                                                                                   
    chunk_data = df[df['Time'] == time]                                                                                                                                                                                                                                           
    # plot or do whatever you'd like with each segment                                                                                                                                                                                                                            
    means = numpy.mean(chunk_data)                                                                                                                                                                                                                                                
    stds = numpy.std(chunk_data)                                                                                                                                                                                                                                                  
    print 'Data for time %d (%0.4f): ' %(i, time)                                                                                                                                                                                                                                 
    print means, stds