Question

我一直在努力为以下文本文件格式提供一些工作。我的总体目标是在整个文本文件中提取其中一个变量名称的值。例如，我想要B行和D行的所有值。然后将它们放在一个正常的numpy数组中并运行计算。

以下是数据文件的样子：

[SECTION1a]
[a] 1424457484310
[b] 5313402937
[c] 873348378938
[d] 882992596992
[e] 14957596088
[SECTION1b]
243 62 184 145 250 180 106 208 248 87 186 137 127 204 18 142 37 67 36 72 48     204 255 30 243 78 44 121 112 139 76 71 131 50 118 10 42 8 67 4 98 110 37 5 208   104 56 55 225 56 0 102 0 21 0 156 0 174 255 171 0 42 0 233 0 50 0 254 0 245 255   110 
[END SECTION1]
[SECTION2a]
[a] 1424457484310
[b] 5313402937
[c] 873348378938
[d] 882992596992
[e] 14957596088
[SECTION2b]
243 62 184 145 250 180 106 208 248 87 186 137 127 204 18 142 37 67 36 72 48   204 255 30 243 78 44 121 112 139 76 71 131 50 118 10 42 8 67 4 98 110 37 5 208 104 56 55 225 56 0 102 0 21 0 156 0 174 255 171 0 42 0 233 0 50 0 254 0 245 255 110 
[END SECTION2]

N个部分继续存在这种模式。

目前我读了这个文件并把它分成两列：

filename_load = fileopenbox(msg=None, title='Load Data File',
                        default="Z:\*",
                        filetypes=None)

col1_data = np.genfromtxt(filename_load, skip_header=1, dtype=None, 
usecols=(0,), usemask=True, invalid_raise=False)

col2_data = np.genfromtxt(filename_load, skip_header=1, dtype=None, 
usecols=(1,), usemask=True, invalid_raise=False)

然后我将使用where，找到我想要的值的索引，然后创建这些值的新数组：

arr_index = np.where(col1_data == '[b]')
new_array = col2_data[arr_index]

问题是，由于奇怪的文件格式，我最终得到两种不同大小的数组，所以显然数组中的数据不能正确匹配正确的变量名。

由于奇怪的文本文件格式以及如何将其读入python，我尝试了其他一些替代方案并陷入困境。

不确定我是否应该留在这条轨道上，如果可以解决问题，或者尝试完全不同的方法。

提前致谢！

Answer 1

将数据排序为OrdedDict()词典的层次结构的可能解决方案：

from collections import OrderedDict
import re


ss = """[SECTION1a]
[a] 1424457484310
[b] 5313402937
[c] 873348378938
[d] 882992596992
[e] 14957596088
[SECTION1b]
243 62 184 145 250 180 106 208 248 87 186 137 127 204 18 142 37 67 36 72 48     204 255 30 243 78 44 121 112 139 76 71 131 50 118 10 42 8 67 4 98 110 37 5 208   104 56 55 225 56 0 102 0 21 0 156 0 174 255 171 0 42 0 233 0 50 0 254 0 245 255   110
[END SECTION1]
[SECTION2a]
[a] 1424457484310
[b] 5313402937
[c] 873348378938
[d] 882992596992
[e] 14957596088
[SECTION2b]
243 62 184 145 250 180 106 208 248 87 186 137 127 204 18 142 37 67 36 72 48   204 255 30 243 78 44 121 112 139 76 71 131 50 118 10 42 8 67 4 98 110 37 5 208 104 56 55 225 56 0 102 0 21 0 156 0 174 255 171 0 42 0 233 0 50 0 254 0 245 255 110
[END SECTION2]"""

# regular expressions for matching SECTIONs
p1 = re.compile("^\[SECTION[0-9]+a\]")
p2 = re.compile("^\[SECTION[0-9]+b\]")
p3 = re.compile("^\[END SECTION[0-9]+\]")

def parse(ss):
    """ Make hierachial dict from string """
    ll, l_cnt = ss.splitlines(), 0
    d = OrderedDict()
    while l_cnt < len(ll): # iterate through lines
        l = ll[l_cnt].strip()
        if p1.match(l):  # new sub dict for [SECTION*a]
            dd, nn = OrderedDict(), l[1:-1]
            l_cnt += 1
            while (p2.match(ll[l_cnt].strip()) is None and
                   p3.match(ll[l_cnt].strip()) is None):
                ww = ll[l_cnt].split()
                dd[ww[0][1:-1]] = int(ww[1])
                l_cnt += 1
            d[nn] = dd
        elif p2.match(l):  # array of ints for [SECTION*b]
            d[l[1:-1]] = [int(w) for w in ll[l_cnt+1].split()]
            l_cnt += 2
        elif p3.match(l):
            l_cnt += 1
    return d

dd = parse(ss)

请注意，如果您使用现有的解析工具（例如Parsley），则可以获得更强大的代码。

要从所有部分检索'[c]'，请执行以下操作：

print("All entries for [c]: ", end="")
cc = [d['c'] for s,d in dd.items() if s.endswith('a')]
print(", ".join(["{}".format(c) for c in cc]))    
# Gives: All entries for [c]: 873348378938, 873348378938

或者你可以遍历整个字典：

def print_recdicts(d, tbw=0):
    """print the hierachial dict """
    for k,v in d.items():
        if type(v) is OrderedDict:
            print(" "*tbw + "* {}:".format(k))
            print_recdicts(v, tbw+2)
        else:
            print(" "*tbw + "* {}: {}".format(k,v))

print_recdicts(dd)
# Gives:
# * SECTION1a:
#   * a: 1424457484310
#   * b: 5313402937
# ...

Answer 2

以下应该这样做。它使用一个正在运行的商店（tally）来处理缺失的值，然后在命中结束标记时将状态写出来。

import re
import numpy as np

filename = "yourfilenamehere.txt"

# [e] 14957596088
match_line_re = re.compile(r"^\[([a-z])\]\W(\d*)")

result = {
    'b':[],
    'd':[],
    }

tally_empty = dict( zip( result.keys(), [np.nan] * len(result) ) )

tally = tally_empty
with open(filename, 'r') as f:
    for line in f:
        if line.startswith('[END SECTION'):
            # Write accumulated data to the lists
            for k, v in tally.items():
                result[k].append(v)

            tally = tally_empty 

        else:
            # Map the items using regex
            m = match_line_re.search(line)
            if m:
                k, v = m.group(1), m.group(2)
                print(k,v)
                if k in tally:
                    tally[k] = v

b = np.array(result['b'])
d = np.array(result['d'])

注意，结果dict定义中的任何键都将在输出中。

在python中从text文件读取和写入数据到numpy列

2 个答案: