Python:通过许多标题拆分大文本文件

时间:2014-10-08 20:10:53

标签: python file text split header

我有一个大文本文件,如下所示:

lat lon altitude pressure
3 lines data group bsas
2.3 4.5 45.0 875
5.6 6.5 46.2 676
3.4 3.4 48.2 565
6 lines data group sdad
3.4 4.5 56.1 535
5.6 6.5 46.2 676    
3.4 4.5 56.1 535
2.3 4.5 45.0 875
5.6 6.5 46.2 676
3.4 3.4 48.2 565
50 lines data group asdasd
5.5 6.6 44.5 343
...
3.7 8.4 56.5 456
... and so on

我想将整个文本文件拆分为单独的数据组,每个数据组将存储在一个二维数组中。直到现在我已经尝试了两种方法来做到这一点。

第一种方法是遍历每一行并获取如下数据:

# define an object class called Wave here
# each object has 4 attributes: lat, lon, altitude, pressure
wave_list = []
with open(filename, 'r') as f:
    next(f) # skip the header
    wave = Wave()
    for i, line in enumerate(f):
        if 'data' in line:
            if wave is not empty:
                wave_list.append(wave)
            wave = Wave()
        else:
            wave.lat.append(line.split()[0])
            wave.lon.append(line.split()[1])
            wave.altitude.append(line.split()[2])
            wave.pressure.append(line.split()[3])
        wave_list.append(wave)
return wave_list

第二种方法是使用numpy loadtext:

f = open(filename, 'r')
txt = f.read()
# split by "data", remove the first element
raw_chunks = txt.split("data")[1:]
# define a new list to store results
wave_list = []
# go through each chunk
for rc in raw_chunks:
    # find the fisrt index of "\n"
    first_id = rc.find("\n")
    # find the last index of "\n"
    last_id = rc.rfind("\n")
    # temporary chunk
    temp_chunk = rc[first_id:last_id]
    # load data using loadtxt
    data = np.loadtxt(StringIO(temp_chunk)          
    wave = Wave()
    wave.lat = data.T[0]
    wave.lon = data.T[1]
    wave.altitude = data.T[2]
    wave.pressure = data.T[3]
    wave_list.append(wave)
return wave_list

但是,这两种方法都很慢。我看看pandas文档,但无法找到避免文件中间标题的方法。我还看了一些例子的不同问题:

Splitting a file based on text in Python

Split the text file in python

How to split and parse a big text file in python in a memory-efficient way?

但没有一个能解决我的问题。有没有更快的方法来阅读这种文本文件。提前谢谢。

1 个答案:

答案 0 :(得分:1)

搜索以<number> lines data group <something>开头的行,存储组(<something>)和要读取的行数(<number>),然后在匹配时存储 n 跟随的行,例如:

给出以下代码:

from itertools import islice
from collections import defaultdict
import re

data = defaultdict(list) 
with open(filename) as fin:
    header = next(fin, '').split()
    for line in fin:
        m = re.match(r'(\d+) lines.*(\b\w+)$', line)
        if m:
            data[m.group(2)].extend(islice(fin, int(m.group(1))))

给出输入:

lat lon altitude pressure
3 lines data group bsas
2.3 4.5 45.0 875
5.6 6.5 46.2 676
3.4 3.4 48.2 565
6 lines data group sdad
3.4 4.5 56.1 535
5.6 6.5 46.2 676    
3.4 4.5 56.1 535
2.3 4.5 45.0 875
5.6 6.5 46.2 676
3.4 3.4 48.2 565

data视为:

{'bsas': ['2.3 4.5 45.0 875\n', '5.6 6.5 46.2 676\n', '3.4 3.4 48.2 565\n'],
 'sdad': ['3.4 4.5 56.1 535\n',
          '5.6 6.5 46.2 676    \n',
          '3.4 4.5 56.1 535\n',
          '2.3 4.5 45.0 875\n',
          '5.6 6.5 46.2 676\n',
          '3.4 3.4 48.2 565\n']}

根据您的意见,如果“小组”无关紧要,那么:

data = []
with open(filename) as fin:
    header = next(fin, '').split()
    for line in fin:
        m = re.match(r'(\d+) lines.*(\b\w+)$', line)
        if m:
            data.append(list(islice(fin, int(m.group(1)))))