我正在尝试将特定行提取为文件中的变量。
这是我test.txt的内容
#first set
Task Identification Number: 210CT1
Task title: Assignment 1
Weight: 25
fullMark: 100
Description: Program and design and complexity running time.
#second set
Task Identification Number: 210CT2
Task title: Assignment 2
Weight: 25
fullMark: 100
Description: Shortest Path Algorithm
#third set
Task Identification Number: 210CT3
Task title: Final Examination
Weight: 50
fullMark: 100
Description: Close Book Examination
这是我的代码
with open(home + '\\Desktop\\PADS Assignment\\test.txt', 'r') as mod:
for line in mod:
taskNumber , taskTile , weight, fullMark , desc = line.strip(' ').split(": ")
print(taskNumber)
print(taskTile)
print(weight)
print(fullMark)
print(description)
这是我想要做的事情:
taskNumber is 210CT1
taskTitle is Assignment 1
weight is 25
fullMark is 100
desc is Program and design and complexity running time
and loop until the third set
但输出中出现错误
ValueError: not enough values to unpack (expected 5, got 2)
对SwiftsNamesake的回应
我试用了你的代码。我仍然收到错误。
ValueError: too many values to unpack (expected 5)
这是我尝试使用您的代码
from itertools import zip_longest
def chunks(iterable, n, fillvalue=None):
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
with open(home + '\\Desktop\\PADS Assignment\\210CT.txt', 'r') as mod:
for group in chunks(mod.readlines(), 5+2, fillvalue=''):
# Choose the item after the colon, excluding the extraneous rows
# that don't have one.
# You could probably find a more elegant way of achieving the same thing
l = [item.split(': ')[1].strip() for item in group if ':' in item]
taskNumber , taskTile , weight, fullMark , desc = l
print(taskNumber , taskTile , weight, fullMark , desc, sep='|')
答案 0 :(得分:2)
如前所述,您需要某种分块。为了有用地分块,我们还需要忽略文件的不相关行。我已经在下面用一些不错的Python巫术实现了这样的功能。
使用namedtuple存储值也可能适合您。 namedtuple是一种非常简单的对象类型,它只存储许多不同的值 - 例如,2D空间中的点可能是带有x和y字段的namedtuple。这是Python documentation中给出的示例。如果您愿意,您应该参考该链接以获取有关namedtuples及其用途的更多信息。我冒昧地使用字段["number", "title", "weight", "fullMark", "desc"]
创建一个Task类。
由于您的变量是任务的所有属性,因此为了简洁和清晰起见,使用命名元组可能有意义。
除此之外,我一直试图坚持你的方法,用冒号分裂。我的代码产生输出
================================================================================
number is 210CT1
title is Assignment 1
weight is 25
fullMark is 100
desc is Program and design and complexity running time.
================================================================================
number is 210CT2
title is Assignment 2
weight is 25
fullMark is 100
desc is Shortest Path Algorithm
================================================================================
number is 210CT3
title is Final Examination
weight is 50
fullMark is 100
desc is Close Book Examination
这似乎与您之后的情况大致相同 - 我不确定您的输出要求有多严格。但是,为此目的进行修改应该相对容易。
这是我的代码,附带一些解释性说明:
from collections import namedtuple
#defines a simple class 'Task' which stores the given properties of a task
Task = namedtuple("Task", ["number", "title", "weight", "fullMark", "desc"])
#chunk a file (or any iterable) into groups of n (as an iterable of n-tuples)
def n_lines(n, read_file):
return zip(*[iter(read_file)] * n)
#used to strip out empty lines and lines beginning with #, as those don't appear to contain any information
def line_is_relevant(line):
return line.strip() and line[0] != '#'
with open("input.txt") as in_file:
#filters the file for relevant lines, and then chunks into 5 lines
for task_lines in n_lines(5, filter(line_is_relevant, in_file)):
#for each line of the task, strip it, split it by the colon and take the second element
#(ie the remainder of the string after the colon), and build a Task from this
task = Task(*(line.strip().split(": ")[1] for line in task_lines))
#just to separate each parsed task
print("=" * 80)
#iterate over the field names and values in the task, and print them
for name, value in task._asdict().items():
print("{} is {}".format(name, value))
您还可以引用任务的每个字段,如下所示:
print("The number is {}".format(task.number))
如果不需要namedtuple方法,请随意用
替换main for循环的内容 taskNumber, taskTitle, weight, fullMark, desc = (line.strip().split(": ")[1] for line in task_lines)
然后您的代码将恢复正常。
关于我所做的其他改变的一些注释:
filter
执行它在锡上所说的内容,只迭代符合谓词的行(line_is_relevant(line)
为True
)。
Task实例化中的*
解包迭代器,因此每个解析的行都是Task构造函数的参数。
表达式(line.strip().split(": ")[1] for line in task_lines)
是一个生成器。这是必要的,因为我们使用task_lines
一次执行多行,因此对于我们的'中的每一行'我们剥离它,用冒号分割它并取第二个元素,即值。
n_lines
函数的工作原理是将同一个迭代器的n个引用列表传递给zip
函数(documentation)。然后zip
尝试从该列表的每个元素中生成下一个元素,但由于n个元素中的每一个都是文件的迭代器,zip
产生n行文件。这一直持续到迭代器耗尽为止。
line_is_relevant
函数使用"真实性"的概念。实现它的更详细的方法可能是
def line_is_relevant(line):
return len(line.strip()) > 0 and line[0] != '#'
但是,在Python中,每个对象都可以隐式地用在布尔逻辑表达式中。这样的表达式中的空字符串(""
)充当False
,非空字符串充当True
,所以很方便,如果line.strip()
为空,它将采取行动因此,False
和line_is_relevant
将为False
。如果第一个操作数是假的,and
运算符也将短路,这意味着第二个操作数不会被评估,因此,方便地,对line[0]
的引用不会导致{IndexError
1}}。
好的,我尝试对n_lines function
进行更广泛的解释:
首先,zip
功能可让您迭代多个' iterable
'立刻。一个iterable就像一个列表或文件,你可以在for循环中查看,所以zip函数可以让你做这样的事情:
>>> for i in zip(["foo", "bar", "baz"], [1, 4, 9]):
... print(i)
...
('foo', 1)
('bar', 4)
('baz', 9)
zip
函数返回' tuple
'每次列表中的一个元素。一个元组基本上是一个列表,除了它是不可变的,所以你不能改变它,因为zip并不期望你改变它给你的任何值,而是用它们做一些事情。 。除了那个之外,元组几乎可以像普通列表一样使用。现在,一个有用的技巧是使用'解包'分隔元组的每个位,如下所示:
>>> for a, b in zip(["foo", "bar", "baz"], [1, 4, 9]):
... print("a is {} and b is {}".format(a, b))
...
a is foo and b is 1
a is bar and b is 4
a is baz and b is 9
一个更简单的解包示例,您可能已经看过(Python也允许您省略括号()):
>>> a, b = (1, 2)
>>> a
1
>>> b
2
虽然n-lines function
没有使用此功能。现在zip
也可以使用多个参数 - 你可以根据需要输入三个,四个或多个列表(非常多)。
>>> for i in zip([1, 2, 3], [0.5, -2, 9], ["cat", "dog", "apple"], "ABC"):
... print(i)
...
(1, 0.5, 'cat', 'A')
(2, -2, 'dog', 'B')
(3, 9, 'apple', 'C')
现在n_lines
功能将*[iter(read_file)] * n
传递给zip
。这里有几件事要介绍 - 我将从第二部分开始。请注意,第一个*
的优先级低于其后的所有内容,因此它等同于*([iter(read_file)] * n)
。现在,iter(read_file)
所做的是通过调用read_file
来构造iter
的迭代器对象。迭代器有点像列表,除了你不能索引它,比如it[0]
。你所能做的只是重复它,就像在for循环中重复它一样。然后它使用此迭代器作为唯一元素构建长度为1的列表。然后它会倍增'此列表由n
。
在Python中,使用带有列表的*运算符将其连接到自身n
次。如果你考虑一下,这种情况是有道理的,因为+
是连接运算符。所以,例如,
>>> [1, 2, 3] * 3 == [1, 2, 3] + [1, 2, 3] + [1, 2, 3] == [1, 2, 3, 1, 2, 3, 1, 2, 3]
True
顺便说一下,这使用了Python的链式比较运算符 - a == b == c
相当于a == b and b == c
,除了b只需要评估一次,这不应该是99%当时。
无论如何,我们现在知道*运算符复制列表n次。它还有一个属性 - 它不会构建任何新对象。这可能是一个问题 -
>>> l = [object()] * 3
>>> id(l[0])
139954667810976
>>> id(l[1])
139954667810976
>>> id(l[2])
139954667810976
这里有三个object
- 但它们实际上都是同一个对象(你可能会认为这是同一个对象的三个指针)。如果要构建更复杂对象的列表(例如列表),并执行就地排序等操作,则会影响列表中的所有元素。
>>> l = [ [3, 2, 1] ] * 4
>>> l
[[3, 2, 1], [3, 2, 1], [3, 2, 1], [3, 2, 1]]
>>> l[0].sort()
>>> l
[[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]]
所以[iter(read_file)] * n
等同于
it = iter(read_file)
l = [it, it, it, it... n times]
现在是第一个*
,优先级较低的zip
,'解包'这又是,但这次并没有将它分配给变量,而是分配给zip
的参数。这意味着>>> def f(a, b):
... print(a + b)
...
>>> f([1, 2]) #doesn't work
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: f() missing 1 required positional argument: 'b'
>>> f(*[1, 2]) #works just like f(1, 2)
3
接收列表的每个元素作为单独的参数,而不是仅列出一个参数。下面是一个如何在更简单的情况下解压缩的示例:
it = iter(read_file)
return zip(it, it, it... n times)
所以实际上,现在我们有类似
的东西{{1}}
请记住,当您进行“迭代”时在for循环中的一个文件对象上,你遍历文件的每一行,所以当zip尝试“重复”时。 n个对象中的每一个一次,它从每个对象中绘制一条线 - 但由于每个对象都是相同的迭代器,所以这条线被消耗掉了#39;它绘制的下一行是文件的下一行。一轮&#39;每个n个参数的迭代产生n行,这就是我们想要的。
答案 1 :(得分:1)
您的line
变量仅获得Task Identification Number: 210CT1
作为其第一个输入。您尝试通过:
拆分来从中提取5个值,但那里只有2个值。
您想要的是将for
循环划分为5,将每组读取为5行,并将每一行拆分为:
。
答案 2 :(得分:0)
您尝试获取的数据超过了一行中的数据;这五个数据分开排列。
正如SwiftsNamesake建议的那样,您可以使用itertools对行进行分组:
import itertools
def keyfunc(line):
# Ignores comments in the data file.
if len(line) > 0 and line[0] == "#":
return True
# The separator is an empty line between the data sets, so it returns
# true when it finds this line.
return line == "\n"
with open(home + '\\Desktop\\PADS Assignment\\test.txt', 'r') as mod:
for k, g in itertools.groupby(mod, keyfunc):
if not k: # Does not process lines that are separators.
for line in g:
data = line.strip().partition(": ")
print(f"{data[0] is {data[2]}")
# print(data[0] + " is " + data[2]) # If python < 3.6
print("") # Prints a newline to separate groups at the end of each group.
如果要在其他函数中使用数据,请将其作为字典从生成器输出:
from collections import OrderedDict
import itertools
def isSeparator(line):
# Ignores comments in the data file.
if len(line) > 0 and line[0] == "#":
return True
# The separator is an empty line between the data sets, so it returns
# true when it finds this line.
return line == "\n"
def parseData(data):
for line in data:
k, s, v = line.strip().partition(": ")
yield k, v
def readData(filePath):
with open(filePath, "r") as mod:
for key, g in itertools.groupby(mod, isSeparator):
if not key: # Does not process lines that are separators.
yield OrderedDict((k, v) for k, v in parseData(g))
def printData(data):
for d in data:
for k, v in d.items():
print(f"{k} is {v}")
# print(k + " is " + v) # If python < 3.6
print("") # Prints a newline to separate groups at the end of each group.
data = readData(home + '\\Desktop\\PADS Assignment\\test.txt')
printData(data)
答案 3 :(得分:0)
正如另一张海报(@Cuber)已经说过的那样,你将逐行循环,而数据集分为五行。错误消息基本上是说当你拥有的是两个时,你正在尝试解压缩五个值。此外,看起来您只对结肠右侧的值感兴趣,因此您实际上只有一个值。
有多种方法可以解决这个问题,但最简单的方法是将数据分组为五个(加上填充,使其成为七个)并一次处理。
首先我们定义chunks
,我们将把这个有点繁琐的过程变成一个优雅的循环(来自the itertools
docs)。
from itertools import zip_longest
def chunks(iterable, n, fillvalue=None):
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
现在,我们将它与您的数据一起使用。我省略了文件样板文件。
for group in chunks(mod.readlines(), 5+2, fillvalue=''):
# Choose the item after the colon, excluding the extraneous rows
# that don't have one.
# You could probably find a more elegant way of achieving the same thing
l = [item.split(': ')[1].strip() for item in group if ':' in item]
taskNumber , taskTile , weight, fullMark , desc = l
print(taskNumber , taskTile , weight, fullMark , desc, sep='|')
2
中的5+2
用于填充(上面的注释和下面的空行)。
目前chunks
的实施可能对您没有意义。如果是这样,我建议查看Python生成器(特别是itertools文档,这是一个了不起的资源)。在Python REPL中弄脏你的手并修补片段也是一个好主意。
答案 4 :(得分:0)
您仍然可以逐行阅读,但您必须帮助代码了解它的解析内容。我们可以使用OrderedDict
来查找相应的变量名。
import os
import collections as ct
def printer(dict_, lookup):
for k, v in lookup.items():
print("{} is {}".format(v, dict_[k]))
print()
names = ct.OrderedDict([
("Task Identification Number", "taskNumber"),
("Task title", "taskTitle"),
("Weight", "weight"),
("fullMark","fullMark"),
("Description", "desc"),
])
filepath = home + '\\Desktop\\PADS Assignment\\test.txt'
with open(filepath, "r") as f:
for line in f.readlines():
line = line.strip()
if line.startswith("#"):
header = line
d = {}
continue
elif line:
k, v = line.split(":")
d[k] = v.strip(" ")
else:
printer(d, names)
printer(d, names)
输出
taskNumber is 210CT3
taskTitle is Final Examination
weight is 50
fullMark is 100
desc is Close Book Examination
taskNumber is 210CT1
taskTitle is Assignment 1
weight is 25
fullMark is 100
desc is Program and design and complexity running time.
taskNumber is 210CT2
taskTitle is Assignment 2
weight is 25
fullMark is 100
desc is Shortest Path Algorithm
答案 5 :(得分:0)
这里的问题是你要按行分割:对于每一行只有1:所以有2个值。 在这一行:
taskNumber , taskTile , weight, fullMark , desc = line.strip(' ').split(": ")
你告诉它有5个值,但它只找到2,所以它会给你一个错误。
解决此问题的一种方法是为每个值运行多个for循环,因为不允许更改文件的格式。我会使用第一个单词并将数据分类到不同的
import re
Identification=[]
title=[]
weight=[]
fullmark=[]
Description=[]
with open(home + '\\Desktop\\PADS Assignment\\test.txt', 'r') as mod::
for line in mod:
list_of_line=re.findall(r'\w+', line)
if len(list_of_line)==0:
pass
else:
if list_of_line[0]=='Task':
if list_of_line[1]=='Identification':
Identification.append(line[28:-1])
if list_of_line[1]=='title':
title.append(line[12:-1])
if list_of_line[0]=='Weight':
weight.append(line[8:-1])
if list_of_line[0]=='fullMark':
fullmark.append(line[10:-1])
if list_of_line[0]=='Description':
Description.append(line[13:-1])
print('taskNumber is %s' % Identification[0])
print('taskTitle is %s' % title[0])
print('Weight is %s' % weight[0])
print('fullMark is %s' %fullmark[0])
print('desc is %s' %Description[0])
print('\n')
print('taskNumber is %s' % Identification[1])
print('taskTitle is %s' % title[1])
print('Weight is %s' % weight[1])
print('fullMark is %s' %fullmark[1])
print('desc is %s' %Description[1])
print('\n')
print('taskNumber is %s' % Identification[2])
print('taskTitle is %s' % title[2])
print('Weight is %s' % weight[2])
print('fullMark is %s' %fullmark[2])
print('desc is %s' %Description[2])
print('\n')
当然你可以使用循环打印,但我太懒了所以我复制粘贴:)。 如果您需要任何帮助或有任何疑问请请! 此代码假定您在编码方面并不高级 祝你好运!!!
答案 6 :(得分:0)
受与itertools相关的解决方案的启发,另一个使用more_itertools.grouper
库中的more-itertools
工具。它的行为类似于@ SwiftsNamesake的chunks
函数。
import collections as ct
import more_itertools as mit
names = dict([
("Task Identification Number", "taskNumber"),
("Task title", "taskTitle"),
("Weight", "weight"),
("fullMark","fullMark"),
("Description", "desc"),
])
filepath = home + '\\Desktop\\PADS Assignment\\test.txt'
with open(filepath, "r") as f:
lines = (line.strip() for line in f.readlines())
for group in mit.grouper(7, lines):
for line in group[1:]:
if not line: continue
k, v = line.split(":")
print("{} is {}".format(names[k], v.strip()))
print()
输出
taskNumber is 210CT1
taskTitle is Assignment 1
weight is 25
fullMark is 100
desc is Program and design and complexity running time.
taskNumber is 210CT2
taskTitle is Assignment 2
weight is 25
fullMark is 100
desc is Shortest Path Algorithm
taskNumber is 210CT3
taskTitle is Final Examination
weight is 50
fullMark is 100
desc is Close Book Examination
注意使用相应的值打印变量名称。