在Python中读取非结构化文本文件以使其结构化

时间:2020-05-05 12:50:56

标签: python python-3.x

File我有这个附加的文本文件,其中包含带有一些信息行的非结构化数据。我如何结构化这些数据(以结构化方式提取信息)。因此,最后我有几列(在本例中为5),并具有相应的信息。框架50包含10个值,框架51包含10个值,依此类推,并且分别获得前4行的值。 我尝试了以下代码。但这不是我得到的最好的列表/数组。 谢谢

frame =[]
frame1 =[]

flag = -1
counter = -1
counter_val = 0
f = open(filepath, "r")
for line in f:
    element = line.split(' ')
    if(len(element) == 4):
        if(element[1] == "Frame_Number") :
            # print(element[1])
            if(flag == 0):
                # print(len(frame1))
                frame.append(frame1)
            flag = 0
            counter = counter + 1
            counter_val = 0
            frame1 =[]
        continue
    if(flag == 0):   
        frame1.append(line)
        counter_val = counter_val + 1

print(frame[1])

2 个答案:

答案 0 :(得分:2)

这是一个熊猫解决方案,

import pandas as pd

# Read in the data as a Pandas Series
df = pd.read_csv('testsd.txt', sep = '\n', header = None, squeeze = True) 

# Get the names of the eventual column names ('# Frame_Number 50', ...)
colNames = df.loc[df.str.startswith('# Frame_Number')]

# Store the first few lines of metadata in another frame and drop them from the original dataframe
meta_df = df[: colNames.index.to_list()[0]]]
df.drop(range(colNames.index.to_list()[0]), inplace = True)

# Drop the eventual column names
df.drop(colNames.index.to_list(), inplace = True)

原始数据帧中剩下的应该只是数据。现在重塑数据框。请注意,这仅在每一列具有相同数目的条目时才有效。

df = pd.DataFrame(df.values.reshape(len(colNames), int(len(df) / len(colNames))).T, columns = colNames)

reshape函数将所需的行数和列数作为参数。它会水平重塑,因此我们将转置结果。最后,如果需要,可以添加我们保存为数据框的一列的元数据,尽管您实际上应该将其另存为文件。

df['meta'] = meta_df

将数据框写入文件:

df.to_csv('testsd.csv')

输出:

enter image description here

答案 1 :(得分:1)

尝试以下

代码

import csv

def convert_csv(filenm):
  " Produces structured data by converting to CSV file "

  with open(filenm, 'r') as fin,  open('out.txt', 'w') as csvfile:
    csv_writer = csv.writer(csvfile, delimiter=' ',
                            quotechar='|', quoting=csv.QUOTE_MINIMAL)

    frames = []
    frame_vals = []
    for line in fin:
      line = line.rstrip()
      if line:
        if line[0] == "#":
          field, value = line[1:].split('=')
          field, value = field.strip(), value.strip()
          if field == 'Frame_Number':
            frames.append(value)    # current frame number
            frame_vals.append([])   # new sublist for frame values
        else:
          frame_vals[-1].append(line.strip())  # append to current frame values

    # Write header
    fnames = ['Frame_' + str(v) for v in frames]
    csv_writer.writerow(fnames)

    # write other data
    for row in zip(*frame_vals):  # transposing to get each frame in a column
      csv_writer.writerow(row)

convert_csv('testd.txt')

测试

输入:testd.txt

# Samples_per_Frame = 8192
# Chirp_Time_sec = 0.000133
# Pulse_Repetition_Time_sec = 0.00050355
# Frame_Period_sec = 0.2

# Frame_Number = 50
0.50061053
0.49938953
0.49426132
0.48962152
0.48791212
0.48937732
0.49523813
0.49914533
0.50158733
0.49914533
# Frame_Number = 51
0.50061053
0.49938953
0.49426132
0.48962152
0.48791212
0.48937732
0.49523813
0.49914533
0.50158733
0.49914533
# Frame_Number = 52
0.50793654
0.50647134
0.49841273
0.48937732
0.48644692
0.49035412
0.49768013
0.50647134
0.51282054
0.50940174
# Frame_Number = 53
0.49670333
0.49181932
0.4840049
0.48547012
0.48791212
0.49230772
0.49768013
0.49816853
0.49181932
0.48595852
# Frame_Number = 54
0.49352872
0.49597073
0.49987793
0.50354093
0.50402933
0.50036633
0.49841273
0.49743593
0.49865693
0.50012213

输出:out.txt

Frame_50 Frame_51 Frame_52 Frame_53 Frame_54
0.50061053 0.50061053 0.50793654 0.49670333 0.49352872
0.49938953 0.49938953 0.50647134 0.49181932 0.49597073
0.49426132 0.49426132 0.49841273 0.4840049 0.49987793
0.48962152 0.48962152 0.48937732 0.48547012 0.50354093
0.48791212 0.48791212 0.48644692 0.48791212 0.50402933
0.48937732 0.48937732 0.49035412 0.49230772 0.50036633
0.49523813 0.49523813 0.49768013 0.49768013 0.49841273
0.49914533 0.49914533 0.50647134 0.49816853 0.49743593
0.50158733 0.50158733 0.51282054 0.49181932 0.49865693
0.49914533 0.49914533 0.50940174 0.48595852 0.50012213