Question

我是Python的新手，对编程而言相对较新。我的基本问题是解析存储在dat文件中的大型数据文件（数百万行数据）。

该文件的样本数据是：

820401001 825029767710821718 8 5 510-180090000 8   9   4
820401001 8083 7970200661367 7 8 0 3-170090070 0  24   1
820401001 8082 4745200341-18 4 9 0 3 240080044 0 -20   2
820401001 8062 5805200461367 2 9 0 3 120066725 0  -7   2
820401001 8037 5292200491-17 7 7 0 3-170090070 0 -16   2

我知道以下信息：

每行的长度始终为56个字符。所有字符都是数字或＆＃34; - ＆＃34;标志。基本上，数据是数字的。
每行有20列，列宽（即字符数）为8,1,5,5,5,1,2,1,2,2,2,2,3,3， 3,1,1,2,4,4

除了解析每一行之外，我还需要对第3列和第4列执行算术运算。具体来说，我想取5个字符并除以100，例如代替29767我想要297.67

目标是创建一个包含结果值的海量矩阵。理想情况下，我想将矩阵保存在新文件中，但我不确定如何执行此操作。

所需的输出类似于：

82040100 1 82.50 297.67 71082 1 71 8 8 5 5 10 -1 800 900 0 0 8 9 4
82040100 1 80.83  79.70 20066 1 36 7 7 8 0  3 -1 700 900 7 0 0 24 1

我知道我可以使用结构库（参见下面的代码尝试）但是我得到了错误＆＃34; unpack_from需要一个至少224字节的缓冲区＆＃34;。我完全不知道这意味着什么。

此外，我不知道如何以有效的方式对第3列和第4列执行操作，即我可以在解析时同时执行此操作，或者添加单独的＆＃34; if -else＆＃34;声明？

import struct

fieldwidths = (8, 1, 5, 5, 5, 1, 2, 1, 2, 2, 2, 2, 2, 3, 3, 1, 1, 2, 4, 4) 
fmtstring = ' '.join('{}{}'.format(abs(fw), 'i') for fw in fieldwidths)
fieldstruct = struct.Struct(fmtstring)

parse = fieldstruct.unpack_from
print('fmtstring: {!r}, recsize: {} chars'.format(fmtstring, fieldstruct.size))

print("Opening the file.")
data_file = open("APR82L.dat", "r")

print("\nReading one line at a time")
#set to 10 just to test
for i in range(10):
    line = data_file.readline()
    print(line)
    fields = parse(line)
    print('fields: {}'.format(fields))

#Close the data file
print("\nClosing the data file")
data_file.close

Answer 1

您的格式字符串由56个int字段组成，每个字段假设长度为4个字节：因此要解压缩字符串，字符串的长度必须至少为4*56 = 224个字节。但是，你传递的是长度为56的字符串（-ish，取决于行结尾）。

您可能会将数据按照适合传递给struct.unpack_from的格式进行处理，但真正的问题是struct旨在打包/解包二进制数据，而不是文本字符串。您可能最终花费更多时间来准备输入，而不是实际解析它。很有可能你会发现更容易避免完全处理struct，并且自己编写一个简单的行解析器，如下所示：

col_widths = [8, 1, 5, 5, 5, 1, 2, 1, 2, 2, 2, 2, 2, 3, 3, 1, 1, 2, 4, 4]

def parse(line): # this is neither blazing fast, nor clever, but it does work.
    fields = []
    idx = 0
    for width in col_widths:
        next_idx = idx + width
        fields.append(int(line[idx:next_idx]))
        idx = next_idx
    return fields

此外，您可能希望使用简单的检查来确保每行甚至都值得解析：

with open('APR82L.dat') as data_file:
    for line in data_file: # This is the normal way to read a file line by line
        if line.strip(): # if the line isn't empty:
            fields = parse(line)

至于执行算术，只要有意义就行。如果这是一个相对简单的操作，我建议编写一个函数来做任何需要做的事情，并在读取数据时调用该函数。

def calculate(fields):
    x = fields[2] # third field
    y = fields[3] # fourth field
    return x + y # or whatever

with open('APR82L.dat') as data_file:
    for line in data_file:
        # parse line into fields as above, then:

        result = calculate(fields)
        # then write the result someplace or whatever's appropriate

Answer 2

numpy library旨在实现这一目标，如果你不介意相对轻量级的依赖性，大多数人都会安排科学计算。

第1步：

将固定宽度格式解析为用cut，awk和/或sed分隔的逗号或空格。

第2步：

import numpy as np
data = np.loadtxt('parsed.txt')

添加第2列和第3列就像

一样简单

output = data[:,2] + data[:,3]

或者，您可以使用函数fromregex一步完成解析和numpy排列。

Answer 3

首先，你有一个ascii数据表，为此，一个好的工具开始是Python内置的字符串操作（更多内容见底部）。此外，您的数据不是“大规模”，因此您应该首先编写代码，以便它很容易，然后在需要时进行优化。

这是一个进行解析的小程序，希望不言自明。我从一个字符串开始定义结构，我只是简单地定义它。

structure = "8i 1i 5f 5f 5i 1i 2i 1i 2i 2i 2i 2i 2i 3i 3i 1i 1i 2i 4i 4i"
structure = structure.split()

result = []
with open("data.txt") as df:
    for line in df.readlines():
        n, vals = 0, []
        for s in structure:
            width = int(s[0])
            val = int(line[n:n+width])
            if s[1]=='f':
                val = val/100.
            vals.append(val)
            n += width
        result.append(vals)

这给出了：

result = [
[82040100, 1, 82.5, 297.67, 71082, 1, 71, 8, 8, 5, 5, 10, -1, 800, 900, 0, 0, 8, 9, 4]
[82040100, 1, 80.83, 79.7, 20066, 1, 36, 7, 7, 8, 0, 3, -1, 700, 900, 7, 0, 0, 24, 1]
[82040100, 1, 80.82, 47.45, 20034, 1, -1, 8, 4, 9, 0, 3, 2, 400, 800, 4, 4, 0, -20, 2]
[82040100, 1, 80.62, 58.05, 20046, 1, 36, 7, 2, 9, 0, 3, 1, 200, 667, 2, 5, 0, -7, 2]
[82040100, 1, 80.37, 52.92, 20049, 1, -1, 7, 7, 7, 0, 3, -1, 700, 900, 7, 0, 0, -16, 2]]

struct主要用于解析二进制数据，因此尽管您可以使用它，但它不是最佳选择。另外，numpy的loadtxt需要一些类型的分隔符，你没有这样的分隔符也不会工作（除非你预先解析数据，这似乎打败了这一点）。

Answer 4

你可以做简单的字符串处理。这是一个应该做你想做的解析器：

class Parser(object):
    def __init__(self, fd):
        self.widths = (8, 1, 3, 2, 3, 2, 5, 1, 2, 1, 2, 2, 2, 2, 2,
                       3, 3, 1, 1, 2, 4, 4)
        self.seps = '  . .                \n'
        self.fd = fd
    def parse(self, filename):
        with open(filename) as file:
            for line in file:
                l = []
                ix = 0
                out = ''
                rank=0
                for j in self.widths:
                    l.append(line[ix:ix+j])
                    out = out + str(int(line[ix:ix+j])) + self.seps[rank]
                    rank += 1
                    ix += j
                self.fd.write(out)

使用文件对象（例如sys.stdout）作为参数创建解析器。此文件对象将接收输出，并使用输入文件名作为参数调用parse：

import sys
parser = Parser(sys.stdout)
parser.parse('input.dat')

如果你想写另一个文件：

with open('outfile.dat', 'w') ad fd:
    parser = Parser(fd)
    parser.parse('input.dat')

Answer 5

请注意：如果您有GNU awk（如果您正在使用GNU / Linux发行版，那么您将会这样做），您可以通过bash提示轻松地进行此转换

awk -vFIELDWIDTHS="8 1 5 5 5 1 2 1 2 2 2 2 2 3 3 1 1 2 4 4" \
    '{$3/=100;$4/=100;print}' \
    < input.dat > newfile.dat

如果您想在输出中使用制表符分隔字段，请在-vOFS='\t'作业之前添加-vFIELDWIDTHS=...。

在Python中解析一个庞大的.dat文件并进行算术运算

5 个答案: