我有一个类似以下格式的文本文件。 (steps.txt)
This is the first line of the file.
here we tell you to make a tea.
step 1
Pour more than enough water for a cup of tea into a regular pot, and bring it to a boil.
step
2
This will prevent the steeping water from dropping in temperature as soon as it is poured in.
step 3
When using tea bags, the measuring has already been done for you - generally it's one tea bag per cup.
我正在尝试获取类似字典的步骤 steps_dic ['step 1'] ='将足够多的水倒入普通的锅中,煮沸。 等等。 **有时步骤号将在下一行中** 我正在读取文件,并在python中为迭代器编写了一个包装器,以解析代码中的行并检查hasnext()。
def step_check(line,prev):
if line:
self.reg1 = re.match(r'^step(\d|\s\d)',line)
if self.reg1:
self._reg1 = self.reg1.group()
# print("in reg1: {} ".format(self._reg1))
if line and prev:
self.only_step = re.match(r'^step$',prev)
if self.only_step:
self._only_step = self.only_step.group()
# print("int only step : {} ".format(self._only_step))
self.only_digit = re.match(r'\d', line)
if self.only_digit:
self._only_digit = self.only_digit.group()
# print("in only digit: {} ".format(self._only_digit))
if self._reg1:
self.step = self._reg1
# print("Returning.. {} ".format(self.step))
return self.step
if self._only_step:
if self._only_digit:
# print("Only Step : {} ".format(self._only_step))
# print ("Only Digit: {} ".format(self._only_digit))
self.step =self._only_step+" "+self._only_digit
# print("Returning.. {} ".format(self.step))
return self.step
else:
# print("Returning.. {} ".format(self.step))
return self.step
with open(file_name, 'r', encoding='utf-8') as f:
self.steps_dict = dict()
self.lines = hn_wrapper(f.readlines())#Wrapper code not including
self.prev,self.line = None,self.lines.next()
self.first_line = self.line
self.prev, self.line = self.line, self.lines.next()
try:
while(self.lines.hasnext()):
self.prev,self.line = self.line,self.lines.next()
print (self.line)
self.step_name = self.step_check(self.line,self.prev)
if self.step_name:
self.steps_dict[self.step_name]=''
self.prev, self.line = self.line, self.lines.next()
while(not self.step_check(self.line,self.prev)):
self.steps_dict[self.step_name] = self.steps_dict[self.step_name]+ self.line + "\n"
self.prev,self.line = self.line,self.lines.next()
我只能得到 step_dic ['step 1'] = ...... step_dic ['step 3'] =.....。 但是第2步已被错过。我还需要提取step_dic ['step 2']。我无法了解文本缓冲区的开头情况。
答案 0 :(得分:4)
您可以将整个文件读入内存,然后运行
re.findall(r'^step\s*(\d+)\s*(.*?)\s*(?=^step\s*\d|\Z)', text, re.DOTALL | re.MULTILINE)
请参见regex demo
详细信息
^
-一行的开头step
-一个step
字\s*
-超过0个空格(\d+)
-第1组:一个或多个数字\s*
-超过0个空格(.*?)
-第2组:任意0个以上的字符,数量尽可能少\s*
-超过0个空格(?=^step\s*\d|\Z)
-在右边,必须有
^step\s*\d
-一行的开始,step
,0个空格和一个数字|
-或\Z
-整个字符串的结尾。快速Python demo:
import re
text = "This is the first line of the file.\nhere we tell you to make a tea.\n\nstep 1\n\nPour more than enough water for a cup of tea into a regular pot, and bring it to a boil.\n\nstep \n2\n\nThis will prevent the steeping water from dropping in temperature as soon as it is poured in.\n\nstep 3 \n\n\nWhen using tea bags, the measuring has already been done for you - generally it's one tea bag per cup."
results = re.findall(r'^step\s*(\d+)\s*(.*?)\s*(?=^step\s*\d|\Z)', text, re.DOTALL | re.MULTILINE)
print(dict([("step{}".format(x),y) for x,y in results]))
输出:
{'step2': 'This will prevent the steeping water from dropping in temperature as soon as it is poured in.', 'step1': 'Pour more than enough water for a cup of tea into a regular pot, and bring it to a boil.', 'step3': "When using tea bags, the measuring has already been done for you - generally it's one tea bag per cup."}
答案 1 :(得分:0)
已修改为包含以下功能:可以检测到问询者对下一行检测号码的修改要求。
应该可以对其进行调整以实现您的目标。基本上将正则表达式排除在外。这也一次只能将文件加载一行(在这种情况下,它并不重要)。
如果文件底部有与步骤无关的文本,则可能会遇到问题,但是应该可以对其进行调整以适应这种情况。另一个问题是,如果您的步数达到了100,但是如果您可以依靠以单词“ step”(不区分大小写)开头的行构成一个步骤,则可以删除helper函数和右半部分。行迭代器下的条件检查。
with open('text.txt') as f:
last_key = False
key = False
check_next = False
step_str = False
my_dict = dict()
for line in f:
if line.strip(' \n').lower() == 'step':
check_next = True
step_str = line.strip()
elif line.lstrip().lower().startswith('step') and not check_next:
if is_int(line[-2:]) and not is_int(line.strip()):
if key:
my_dict[key] = val
last_key = key
key = line.strip()
else:
key = line.strip()
val = ''
elif check_next and all(s == '\n' for s in line.strip()):
continue
elif is_int(line.strip()) and check_next:
my_dict[key] = val
last_key = key
key = '{} {}'.format(step_str, line.strip())
check_next = False
elif key:
val += line
if key != last_key:
my_dict[key] = val
结果:
{'step 1': '\nPour more than enough water for a cup of tea into a regular pot, and bring it to a boil.\n\n', 'step 2': '\nPour more than enough water for a cup of tea into a regular pot, and bring it to a boil.\n\n\n This will prevent the steeping water from dropping in temperature as soon as it is poured in.\n\n', 'step 3': "\nPour more than enough water for a cup of tea into a regular pot, and bring it to a boil.\n\n\n This will prevent the steeping water from dropping in temperature as soon as it is poured in.\n\n\n\nWhen using tea bags, the measuring has already been done for you - generally it's one tea bag per cup."}