Question

我有一个文件，每行包含字符串：

Format = 1A Rnti = 65535 (SI-RNTI) Format0/Format1A Differentiation Flag = 1 Localised / Distributed VRB Assignment Flag = 0 Resource Block Assignment = 0x00000000 Resource Blocks Detail = (RBstart:0, Lcrbs:1, Ndlrb:50) Modulation and Coding Scheme = 5 Harq Process Number = 16 (Broadcast HARQ Process) New Data Indicator = 0 Redundancy Version = 0 TPC Command = 0 (-1 dB)

我想选择字段名作为键并用＆＃39; =＆＃39;分隔。保存相应的值。 Fieldnames是格式，Rnti，Format0 / Format1A区分标志，本地化/分布式等。我正在尝试如下：

with open("log.txt","r") as log:
   reader = csv.reader(log,delimiter='=')
   csv3 = open("csv3.txt","w")
   for row in reader:
      print >> csv3,row

但我无法正确分割字段名（键）和值。有没有办法可以定义所有键名，然后将字符串中相应的值存储在字典中？

Answer 1

解析这个怪物的最佳选择（在你找到它之后发现它的一点）是先识别所有的字段名称，然后捕获这些字段之间的值。

最简单的方法是找到行中的每个字段，并在其后面的等号和下一个找到的字段之间提取值。类似的东西：

# List all fields here (if possible in order of appearance)
# Everything not listed will end up as a part of another detected field's value
FIELD_LIST = ["Format", "Rnti", "Format0/Format1A Differentiation Flag",
              "Localised / Distributed VRB Assignment Flag", "Resource Block Assignment",
              "Resource Blocks Detail", "Modulation and Coding Scheme",
              "Harq Process Number", "New Data Indicator", "Redundancy Version",
              "TPC Command"]

# lets separate a logic to parse our ugly log in a function
def parse_ugly_log_line(log):
    field_indexes = {field: log.find(field) for field in FIELD_LIST}  # get field indexes
    field_order = sorted(field_indexes, key=field_indexes.get)  # sort indexes
    parsed_fields = {}  # store for our fields
    for i, field in enumerate(field_order):
        if field_indexes[field] == -1:  # field not found, skip
            continue
        field_start = log.find("=", field_indexes[field])  # value begins after `=`
        if field_start == -1:  # cannot find the field value, skip
            continue
        # field value ends where the next field begins:
        field_end = field_indexes[field_order[i + 1]] if i < len(field_order) - 1 else None
        if field_end and field_start > field_end:  # overlapping field value, skip
            continue
        parsed_fields[field] = log[field_start + 1:field_end].strip()  # extract the value
    return parsed_fields

# lets now open our log file and parse it line by line:
logs = []  # storage of the parsed data
with open("your_log.txt", "r") as f:
    for line in f:
        logs.append(parse_ugly_log_line(line))

# you can now access individual fields for each of the lines, e.g.:
print(logs[0]["Modulation and Coding Scheme"])  # prints: 5
print(logs[4]["Resource Block Assignment"])  # prints: 0x00000032

您可以使用正则表达式（类似(field1|field2|etc)\s*=(.*)(?!field1|field2|etc)并捕获两个组来获取字段，值元组）来实现类似的效果但我不是构建超长正则表达式模式的粉丝而且未设计正则表达式引擎反正这样的任务。

Answer 2

我认为，这种混乱文件的唯一方法是regexp。

import re 


def dicts_generator():
"""Generates dicts with data from YOUR_FILE"""

    # Defining search regexp.
    #
    # Note, that regexp here in VERBOSE MODE. It means spaces are ignored
    # and comments are alowed. Because of it I had to escape real spaces.
    # In ends i've use \s in order to make more visilble to you.
    #
    # In regexp each line is a parameter. I did not understand what exactly
    # is Rnti, so i could be wrong in exact definitions of it.


    r=re.compile(r"""

        # (?P<format> .... )  - group named 'format'
        # it will be a dict key

        Format\ =\ (?P<format>.+?)\s+   

        Rnti\ =\ (?P<rnti>.+?)\s+

        Differentiation\ Flag\ =\ (?P<differentiation_flag>.+?)\s+

        # add other parameters here

        """, re.VERBOSE)

    # Read line after line and make search in it.
    # Actually, it is OK to search whole file at once, 
    # but forme this way is more clear.        
    for line in open(YOUR_FILE, mode="tr"):
        for m in r.finditer(line):
            yield m.groupdict()


for d in dicts_generator():
    print(d)   # do whatever you want with dict 'd'.

打印：

{'format': '1A', 'differentiation_flag': '1', 'rnti': '65535 (SI-RNTI) Format0/Format1A'}

如何解析一行并将其拆分以保存在python中的字典中

2 个答案: