Question

我正在处理这个读取文本文件的代码（在python上）。文本文件包含构造特定几何体的信息，并使用关键字按部分分隔，例如文件：

if

包含顶点为（0,0），（0,10），（10,0），（10,10）的正方形的信息。＆＃34; *边缘＆＃34; part定义顶点之间的连接。每行中的第一个数字是一个ID号。

这是我的问题，文本文件中的信息不一定是有序的，有时是＆＃34; Vertices＆＃34;部分首先出现，有时出现＆＃34;边缘＆＃34;部分将首先出现。我还有其他关键字，因此我尝试避免重复open file read line by line if line == *Points store all the following lines in a list until a new *command is encountered close file open file (again) read line by line if line == *Edges store all the following lines in a list until a new *command is encountered close file open file (again) ...语句来测试每行是否有新关键字。

我一直在做的是多次阅读文本文件，每次都要查找不同的关键字：

while True:
# create a combination
# test the combination
while game_won == False:
    print(scoreboard)
    # player input combination
    # combination is tested then added to scoreboard
    tries_left = tries_left+1
    if game_won == True:
        print(You Won!)
        input = Play Again? Y/N
    if tries_left == 10:
        print(You Lost!)
        input = Play Again? Y/N

有人可以指出如何在没有这么繁琐的程序的情况下识别这些关键字吗？感谢。

Answer 1

它们是无序的这一事实我认为很适合解析成一个字典，您可以在以后访问它。我编写了一个可能对此任务有用的函数：

features = ['POINTS','EDGES']

def parseFile(dictionary, f, features):
    """
    Creates a format where you can access a shape feature like:
        dictionary[shapeID][feature] = [  [1 1 1], [1,1,1] ... ]

    Assumes: all features although out of order occurs in the order
        shape1
            *feature1
                .
                .
                .
            *featuren
    Assumes all possible features are in in the list features

    f is input file handle
    """
    shapeID = 0
    found = []
    for line in f:

        if line[0] == '*' and found != features:
            found.append(line[1:]) #appends feature like POINTS to found
            feature = line[1:]

        elif line[0] == '*' and found == features:
            found = []
            shapeID += 1
            feature = line[1:] #current feature

        else:
            dictionary[shapeID][feature].append(
                [int(i) for i in line.split(' ')]
                )

    return dictionary

#to access the shape features you can get vertices like:

for vertice in dictionary[shapeID]['POINTS']:
    print vertice

#to access edges

for edge in dictionary[shapeID]['EDGES']:
    print edge

Answer 2

您应该只创建部分的字典。您可以使用生成器来读取文件并按照它们到达的顺序生成每个部分，并根据结果构建字典以下是一些可能对您有帮助的不完整代码：

def load(f):
    with open(f) as file:
        section = next(file).strip()  # Assumes first line is always a section
        data = []
        for line in file:
            if line[0] == '*':        # Any appropriate test for a new section
                yield section, data
                section = line.strip()
                data = []
            else:
                data.append(list(map(int, line.strip().split())))
        yield section, data

假设上面的数据位于名为data.txt的文件中：

>>> data = dict(load('data.txt'))
>>> data
{'*EDGES': [[1, 1, 2], [2, 1, 4], [3, 2, 3], [4, 3, 4]],
 '*VERTICES': [[1, 0, 0, 0], [2, 10, 0, 0], [3, 10, 10, 0], [4, 0, 10, 0]]}

然后您可以参考每个部分，例如：

for edge in data['*EDGES']:
    ...

Answer 3

假设您的文件名为'data.txt'

from collections import defaultdict

def get_data():
    d = defaultdict(list)
    with open('data.txt') as f:
        key = None
        for line in f:
            if line.startswith('*'):
                key = line.rstrip()
                continue
            d[key].append(line.rstrip())
    return d

返回的defaultdict如下所示：

defaultdict(list,
            {'*EDGES': ['1 1 2', '2 1 4', '3 2 3', '4 3 4'],
             '*VERTICES': ['1 0 0 0', '2 10 0 0', '3 10 10 0', '4 0 10 0']})

您可以像普通字典一样访问数据

d['*EDGES']
['1 1 2', '2 1 4', '3 2 3', '4 3 4']

Answer 4

您可以阅读该文件一次，并将内容存储在dictionary中。由于您已经方便地标记了＆＃34;命令＆＃34;使用*的行，您可以使用以*开头的所有行作为字典键，将所有后续行用作该键的值。您可以使用for循环执行此操作：

with open('geometry.txt') as f:
    x = {}  
    key = None  # store the most recent "command" here
    for y in f.readlines()
        if y[0] == '*':
            key = y[1:] # your "command"
            x[key] = []
        else:
            x[key].append(y.split()) # add subsequent lines to the most recent key

或者你可以利用python的列表和字典理解在一行中做同样的事情：

with open('test.txt') as f:
    x = {y.split('\n')[0]:[z.split() for z in y.strip().split('\n')[1:]] for y in f.read().split('*')[1:]}

我承认看起来并不是很好看但是通过将整个文件拆分成＆＃39; *＆＃39;之间的块来完成工作。字符然后使用新的行和空格作为分隔符将剩余的块拆分为字典键和列表列表（作为字典值）。

有关分割，剥离和切片字符串的详细信息，请参见here

Answer 5

这种解析的一种常见策略是构建一个可以一次生成一个部分数据的函数。那么你的顶级调用代码可以非常简单，因为它根本不用担心段逻辑。以下是您的数据示例：

import sys

def main(file_path):
    # An example usage.
    for section_name, rows in sections(file_path):
        print('===============')
        print(section_name)
        for row in rows:
            print(row)

def sections(file_path):
    # Setup.
    section_name = None
    rows = []

    # Process the file.
    with open(file_path) as fh:
        for line in fh:
            # Section start: yield any rows we have so far,
            # and then update the section name.
            if line.startswith('*'):
                if rows:
                    yield (section_name, rows)
                    rows = []
                section_name = line[1:].strip()
            # Otherwise, just add another row.
            else:
                row = line.split()
                rows.append(row)

    # Don't forget the last batch of rows.
    if rows:
        yield (section_name, rows)

main(sys.argv[1])

Answer 6

鉴于您的数据没有排序，字典可能是最佳选择。将文件读入列表后，您可以按部分名称访问它。请注意，with关键字会自动关闭您的文件。

以下是它的样子：

# read the data file into a simple list:
with open('file.dat') as f:
    lines = list(f)

# get the line numbers for each section:
section_line_nos = [line for line, data in enumerate(lines) if '*' == data[0]]
# add a terminating line number to mark end of the file:
section_line_nos.append(len(lines))

# split each section off into a new list, all contained in a dictionary
# with the section names as keys
section_dict = {lines[section_line_no][1:]:lines[section_line_no + 1: section_line_nos[section_no + 1]] for section_no, section_line_no in enumerate(section_line_nos[:-1])}

你会得到一个如下字典：

{'VERTICES': ['1 0 0 0', '2 10 0 0', '3 10 10 0', '4 0 10 0'], 'EDGES': ['1 1 2', '2 1 4', '3 2 3', '4 3 4']}

以这种方式访问每个部分：

section_dict['EDGES']

请注意，上面的代码假设每个部分以*开头，而其他部分没有以*开头。如果不是第一种情况，您可以进行此更改：

section_names = ['*EDGES', '*VERTICES']
section_line_nos = [line for line, data in enumerate(lines) if data.strip() in section_names]

另请注意section_dict代码的这一部分：

lines[section_line_no][1:]

...在每个部分名称的开头摆脱了明星。如果不需要，可以将其更改为：

lines[section_line_no]

如果可能在您的部分名称行中有不受欢迎的空白区域，您可以这样做以消除它：

lines[section_line_no].strip()[1:]

我还没有对所有这些进行过测试，但这是一般的想法。

如何阅读和组织按关键字划分的文本文件

6 个答案: