Question

我正在尝试将特定行提取为文件中的变量。

这是我test.txt的内容

#first set
Task Identification Number: 210CT1
Task title: Assignment 1
Weight: 25
fullMark: 100
Description: Program and design and complexity running time.

#second set
Task Identification Number: 210CT2
Task title: Assignment 2
Weight: 25
fullMark: 100
Description: Shortest Path Algorithm

#third set
Task Identification Number: 210CT3
Task title: Final Examination
Weight: 50
fullMark: 100
Description: Close Book Examination

这是我的代码

with open(home + '\\Desktop\\PADS Assignment\\test.txt', 'r') as mod:
    for line in mod:
        taskNumber , taskTile , weight, fullMark , desc = line.strip(' ').split(": ") 
        print(taskNumber)
        print(taskTile)
        print(weight)
        print(fullMark)
        print(description)

这是我想要做的事情：

taskNumber is 210CT1 
taskTitle is Assignment 1
weight is 25
fullMark is 100
desc is Program and design and complexity running time

and loop until the third set

但输出中出现错误

ValueError: not enough values to unpack (expected 5, got 2)

对SwiftsNamesake的回应

我试用了你的代码。我仍然收到错误。

ValueError: too many values to unpack (expected 5)

这是我尝试使用您的代码

 from itertools import zip_longest

 def chunks(iterable, n, fillvalue=None):
     args = [iter(iterable)] * n
     return zip_longest(*args, fillvalue=fillvalue)


with open(home + '\\Desktop\\PADS Assignment\\210CT.txt', 'r') as mod:
    for group in chunks(mod.readlines(), 5+2, fillvalue=''):
    # Choose the item after the colon, excluding the extraneous rows
    # that don't have one.
    # You could probably find a more elegant way of achieving the same thing
        l = [item.split(': ')[1].strip() for item in group if ':' in item]
    taskNumber , taskTile , weight, fullMark , desc = l
        print(taskNumber , taskTile , weight, fullMark , desc, sep='|')

Answer 1

如前所述，您需要某种分块。为了有用地分块，我们还需要忽略文件的不相关行。我已经在下面用一些不错的Python巫术实现了这样的功能。

使用namedtuple存储值也可能适合您。 namedtuple是一种非常简单的对象类型，它只存储许多不同的值 - 例如，2D空间中的点可能是带有x和y字段的namedtuple。这是Python documentation中给出的示例。如果您愿意，您应该参考该链接以获取有关namedtuples及其用途的更多信息。我冒昧地使用字段["number", "title", "weight", "fullMark", "desc"]创建一个Task类。

由于您的变量是任务的所有属性，因此为了简洁和清晰起见，使用命名元组可能有意义。

除此之外，我一直试图坚持你的方法，用冒号分裂。我的代码产生输出

================================================================================
number is 210CT1
title is Assignment 1
weight is 25
fullMark is 100
desc is Program and design and complexity running time.
================================================================================
number is 210CT2
title is Assignment 2
weight is 25
fullMark is 100
desc is Shortest Path Algorithm
================================================================================
number is 210CT3
title is Final Examination
weight is 50
fullMark is 100
desc is Close Book Examination

这似乎与您之后的情况大致相同 - 我不确定您的输出要求有多严格。但是，为此目的进行修改应该相对容易。

这是我的代码，附带一些解释性说明：

from collections import namedtuple

#defines a simple class 'Task' which stores the given properties of a task
Task = namedtuple("Task", ["number", "title", "weight", "fullMark", "desc"])

#chunk a file (or any iterable) into groups of n (as an iterable of n-tuples)
def n_lines(n, read_file):
    return zip(*[iter(read_file)] * n)

#used to strip out empty lines and lines beginning with #, as those don't appear to contain any information
def line_is_relevant(line):
    return line.strip() and line[0] != '#'

with open("input.txt") as in_file:
    #filters the file for relevant lines, and then chunks into 5 lines
    for task_lines in n_lines(5, filter(line_is_relevant, in_file)):
        #for each line of the task, strip it, split it by the colon and take the second element
        #(ie the remainder of the string after the colon), and build a Task from this
        task = Task(*(line.strip().split(": ")[1] for line in task_lines))
        #just to separate each parsed task
        print("=" * 80)
        #iterate over the field names and values in the task, and print them
        for name, value in task._asdict().items():
            print("{} is {}".format(name, value))

您还可以引用任务的每个字段，如下所示：

            print("The number is {}".format(task.number))

如果不需要namedtuple方法，请随意用

替换main for循环的内容

        taskNumber, taskTitle, weight, fullMark, desc = (line.strip().split(": ")[1] for line in task_lines)

然后您的代码将恢复正常。

关于我所做的其他改变的一些注释：

filter执行它在锡上所说的内容，只迭代符合谓词的行（line_is_relevant(line)为True）。

Task实例化中的*解包迭代器，因此每个解析的行都是Task构造函数的参数。

表达式(line.strip().split(": ")[1] for line in task_lines)是一个生成器。这是必要的，因为我们使用task_lines一次执行多行，因此对于我们的＆＃39;中的每一行＆＃39;我们剥离它，用冒号分割它并取第二个元素，即值。

n_lines函数的工作原理是将同一个迭代器的n个引用列表传递给zip函数（documentation）。然后zip尝试从该列表的每个元素中生成下一个元素，但由于n个元素中的每一个都是文件的迭代器，zip产生n行文件。这一直持续到迭代器耗尽为止。

line_is_relevant函数使用＆＃34;真实性＆＃34;的概念。实现它的更详细的方法可能是

def line_is_relevant(line):
    return len(line.strip()) > 0 and line[0] != '#'

但是，在Python中，每个对象都可以隐式地用在布尔逻辑表达式中。这样的表达式中的空字符串（""）充当False，非空字符串充当True，所以很方便，如果line.strip()为空，它将采取行动因此，False和line_is_relevant将为False。如果第一个操作数是假的，and运算符也将短路，这意味着第二个操作数不会被评估，因此，方便地，对line[0]的引用不会导致{IndexError 1}}。

好的，我尝试对n_lines function进行更广泛的解释：

首先，zip功能可让您迭代多个＆＃39; iterable＆＃39;立刻。一个iterable就像一个列表或文件，你可以在for循环中查看，所以zip函数可以让你做这样的事情：

>>> for i in zip(["foo", "bar", "baz"], [1, 4, 9]):
...     print(i)
... 
('foo', 1)
('bar', 4)
('baz', 9)

zip函数返回＆＃39; tuple＆＃39;每次列表中的一个元素。一个元组基本上是一个列表，除了它是不可变的，所以你不能改变它，因为zip并不期望你改变它给你的任何值，而是用它们做一些事情。。除了那个之外，元组几乎可以像普通列表一样使用。现在，一个有用的技巧是使用＆＃39;解包＆＃39;分隔元组的每个位，如下所示：

>>> for a, b in zip(["foo", "bar", "baz"], [1, 4, 9]):
...     print("a is {} and b is {}".format(a, b))  
... 
a is foo and b is 1
a is bar and b is 4
a is baz and b is 9

一个更简单的解包示例，您可能已经看过（Python也允许您省略括号（））：

>>> a, b = (1, 2)
>>> a
1
>>> b
2

虽然n-lines function没有使用此功能。现在zip也可以使用多个参数 - 你可以根据需要输入三个，四个或多个列表（非常多）。

>>> for i in zip([1, 2, 3], [0.5, -2, 9], ["cat", "dog", "apple"], "ABC"):
...     print(i)
... 
(1, 0.5, 'cat', 'A')
(2, -2, 'dog', 'B')
(3, 9, 'apple', 'C')

现在n_lines功能将*[iter(read_file)] * n传递给zip。这里有几件事要介绍 - 我将从第二部分开始。请注意，第一个*的优先级低于其后的所有内容，因此它等同于*([iter(read_file)] * n)。现在，iter(read_file)所做的是通过调用read_file来构造iter的迭代器对象。迭代器有点像列表，除了你不能索引它，比如it[0]。你所能做的只是重复它，就像在for循环中重复它一样。然后它使用此迭代器作为唯一元素构建长度为1的列表。然后它会倍增＆＃39;此列表由n。

组成

在Python中，使用带有列表的*运算符将其连接到自身n次。如果你考虑一下，这种情况是有道理的，因为+是连接运算符。所以，例如，

>>> [1, 2, 3] * 3 == [1, 2, 3] + [1, 2, 3] + [1, 2, 3] == [1, 2, 3, 1, 2, 3, 1, 2, 3]
True

顺便说一下，这使用了Python的链式比较运算符 - a == b == c相当于a == b and b == c，除了b只需要评估一次，这不应该是99％当时。

无论如何，我们现在知道*运算符复制列表n次。它还有一个属性 - 它不会构建任何新对象。这可能是一个问题 -

>>> l = [object()] * 3
>>> id(l[0])
139954667810976
>>> id(l[1])
139954667810976
>>> id(l[2])
139954667810976

这里有三个object - 但它们实际上都是同一个对象（你可能会认为这是同一个对象的三个指针）。如果要构建更复杂对象的列表（例如列表），并执行就地排序等操作，则会影响列表中的所有元素。

>>> l = [ [3, 2, 1] ] * 4
>>> l
[[3, 2, 1], [3, 2, 1], [3, 2, 1], [3, 2, 1]]
>>> l[0].sort()
>>> l
[[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]]

所以[iter(read_file)] * n等同于

it = iter(read_file)
l = [it, it, it, it... n times]

现在是第一个*，优先级较低的zip，＆＃39;解包＆＃39;这又是，但这次并没有将它分配给变量，而是分配给zip的参数。这意味着>>> def f(a, b): ... print(a + b) ... >>> f([1, 2]) #doesn't work Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: f() missing 1 required positional argument: 'b' >>> f(*[1, 2]) #works just like f(1, 2) 3接收列表的每个元素作为单独的参数，而不是仅列出一个参数。下面是一个如何在更简单的情况下解压缩的示例：

it = iter(read_file)
return zip(it, it, it... n times)

所以实际上，现在我们有类似

的东西

{{1}}

请记住，当您进行“迭代”时在for循环中的一个文件对象上，你遍历文件的每一行，所以当zip尝试“重复”时。 n个对象中的每一个一次，它从每个对象中绘制一条线 - 但由于每个对象都是相同的迭代器，所以这条线被消耗掉了＃39;它绘制的下一行是文件的下一行。一轮＆＃39;每个n个参数的迭代产生n行，这就是我们想要的。

Answer 2

您的line变量仅获得Task Identification Number: 210CT1作为其第一个输入。您尝试通过:拆分来从中提取5个值，但那里只有2个值。

您想要的是将for循环划分为5，将每组读取为5行，并将每一行拆分为:。

Answer 3

您尝试获取的数据超过了一行中的数据;这五个数据分开排列。

正如SwiftsNamesake建议的那样，您可以使用itertools对行进行分组：

import itertools

def keyfunc(line):
    # Ignores comments in the data file.
    if len(line) > 0 and line[0] == "#":
        return True

    # The separator is an empty line between the data sets, so it returns
    # true when it finds this line.
    return line == "\n"

with open(home + '\\Desktop\\PADS Assignment\\test.txt', 'r') as mod:
    for k, g in itertools.groupby(mod, keyfunc):
        if not k: # Does not process lines that are separators.
            for line in g:
                data = line.strip().partition(": ")
                print(f"{data[0] is {data[2]}")
                # print(data[0] + " is " + data[2]) # If python < 3.6

            print("") # Prints a newline to separate groups at the end of each group.

如果要在其他函数中使用数据，请将其作为字典从生成器输出：

from collections import OrderedDict
import itertools

def isSeparator(line):
    # Ignores comments in the data file.
    if len(line) > 0 and line[0] == "#":
        return True

    # The separator is an empty line between the data sets, so it returns
    # true when it finds this line.
    return line == "\n"

def parseData(data):
    for line in data:
        k, s, v = line.strip().partition(": ")
        yield k, v

def readData(filePath):
    with open(filePath, "r") as mod:
        for key, g in itertools.groupby(mod, isSeparator):
            if not key: # Does not process lines that are separators.
                yield OrderedDict((k, v) for k, v in parseData(g))

def printData(data):
    for d in data:
        for k, v in d.items():
          print(f"{k} is {v}")
          # print(k + " is " + v) # If python < 3.6

        print("") # Prints a newline to separate groups at the end of each group.

data = readData(home + '\\Desktop\\PADS Assignment\\test.txt')
printData(data)

Answer 4

正如另一张海报（@Cuber）已经说过的那样，你将逐行循环，而数据集分为五行。错误消息基本上是说当你拥有的是两个时，你正在尝试解压缩五个值。此外，看起来您只对结肠右侧的值感兴趣，因此您实际上只有一个值。

有多种方法可以解决这个问题，但最简单的方法是将数据分组为五个（加上填充，使其成为七个）并一次处理。

首先我们定义chunks，我们将把这个有点繁琐的过程变成一个优雅的循环（来自the itertools docs）。

from itertools import zip_longest

def chunks(iterable, n, fillvalue=None):
  args = [iter(iterable)] * n
  return zip_longest(*args, fillvalue=fillvalue)

现在，我们将它与您的数据一起使用。我省略了文件样板文件。

for group in chunks(mod.readlines(), 5+2, fillvalue=''):
  # Choose the item after the colon, excluding the extraneous rows
  # that don't have one.
  # You could probably find a more elegant way of achieving the same thing
  l = [item.split(': ')[1].strip() for item in group if ':' in item]
  taskNumber , taskTile , weight, fullMark , desc = l
  print(taskNumber , taskTile , weight, fullMark , desc, sep='|')

2中的5+2用于填充（上面的注释和下面的空行）。

目前chunks的实施可能对您没有意义。如果是这样，我建议查看Python生成器（特别是itertools文档，这是一个了不起的资源）。在Python REPL中弄脏你的手并修补片段也是一个好主意。

Answer 5

您仍然可以逐行阅读，但您必须帮助代码了解它的解析内容。我们可以使用OrderedDict来查找相应的变量名。

import os
import collections as ct


def printer(dict_, lookup):
    for k, v in lookup.items():
        print("{} is {}".format(v, dict_[k]))
    print()


names = ct.OrderedDict([
    ("Task Identification Number", "taskNumber"),
    ("Task title", "taskTitle"),
    ("Weight", "weight"),
    ("fullMark","fullMark"),
    ("Description", "desc"),
])

filepath = home + '\\Desktop\\PADS Assignment\\test.txt'
with open(filepath, "r") as f:
    for line in f.readlines():
        line = line.strip()
        if line.startswith("#"):
            header = line
            d = {}
            continue
        elif line:
            k, v = line.split(":")
            d[k] = v.strip(" ")
        else:
            printer(d, names)
    printer(d, names)

输出

taskNumber is 210CT3
taskTitle is Final Examination
weight is 50
fullMark is 100
desc is Close Book Examination

taskNumber is 210CT1
taskTitle is Assignment 1
weight is 25
fullMark is 100
desc is Program and design and complexity running time.

taskNumber is 210CT2
taskTitle is Assignment 2
weight is 25
fullMark is 100
desc is Shortest Path Algorithm

Answer 6

这里的问题是你要按行分割：对于每一行只有1：所以有2个值。在这一行：

taskNumber , taskTile , weight, fullMark , desc = line.strip(' ').split(": ")

你告诉它有5个值，但它只找到2，所以它会给你一个错误。

解决此问题的一种方法是为每个值运行多个for循环，因为不允许更改文件的格式。我会使用第一个单词并将数据分类到不同的

import re
Identification=[]
title=[]
weight=[]
fullmark=[]
Description=[]
with open(home + '\\Desktop\\PADS Assignment\\test.txt', 'r') as mod::
    for line in mod:
        list_of_line=re.findall(r'\w+', line)
        if len(list_of_line)==0:
            pass
        else:
            if list_of_line[0]=='Task':
                if list_of_line[1]=='Identification':
                    Identification.append(line[28:-1])
                if list_of_line[1]=='title':
                    title.append(line[12:-1])
            if list_of_line[0]=='Weight':
                weight.append(line[8:-1])
            if list_of_line[0]=='fullMark':
                fullmark.append(line[10:-1])
            if list_of_line[0]=='Description':
                Description.append(line[13:-1])


print('taskNumber is %s' % Identification[0])
print('taskTitle is %s' % title[0])
print('Weight is %s' % weight[0])
print('fullMark is %s' %fullmark[0])
print('desc is %s' %Description[0])
print('\n')
print('taskNumber is %s' % Identification[1])
print('taskTitle is %s' % title[1])
print('Weight is %s' % weight[1])
print('fullMark is %s' %fullmark[1])
print('desc is %s' %Description[1])
print('\n')
print('taskNumber is %s' % Identification[2])
print('taskTitle is %s' % title[2])
print('Weight is %s' % weight[2])
print('fullMark is %s' %fullmark[2])
print('desc is %s' %Description[2])
print('\n')

当然你可以使用循环打印，但我太懒了所以我复制粘贴:)。如果您需要任何帮助或有任何疑问请请！此代码假定您在编码方面并不高级祝你好运!!!

Answer 7

受与itertools相关的解决方案的启发，另一个使用more_itertools.grouper库中的more-itertools工具。它的行为类似于@ SwiftsNamesake的chunks函数。

import collections as ct

import more_itertools as mit


names = dict([
    ("Task Identification Number", "taskNumber"),
    ("Task title", "taskTitle"),
    ("Weight", "weight"),
    ("fullMark","fullMark"),
    ("Description", "desc"),
])


filepath = home + '\\Desktop\\PADS Assignment\\test.txt'
with open(filepath, "r") as f:
    lines = (line.strip() for line in f.readlines())
    for group in mit.grouper(7, lines):
        for line in group[1:]:
            if not line: continue
            k, v = line.split(":")
            print("{} is {}".format(names[k], v.strip()))
        print()

输出

taskNumber is 210CT1
taskTitle is Assignment 1
weight is 25
fullMark is 100
desc is Program and design and complexity running time.

taskNumber is 210CT2
taskTitle is Assignment 2
weight is 25
fullMark is 100
desc is Shortest Path Algorithm

taskNumber is 210CT3
taskTitle is Final Examination
weight is 50
fullMark is 100
desc is Close Book Examination

注意使用相应的值打印变量名称。

Python如何将特定字符串提取到多个变量中

7 个答案: