我正在尝试使用re模块搜索相当大的文件的字符串。我正在搜索的文件格式如下:
class addTwoNumbers1(object):
def __init__(self, number1, number2):
self.number1 = number1
self.number2 = number2
self.result = number1 + number2
class addTwoNumbers2(object):
def __init__(self, number1, number2):
self.result = number1 + number2
class addTwoNumbers3(object):
def __init__(self, number1, number2):
self.number1 = number1
self.number2 = number2
def Add(self):
result = self.number1 + self.number2
return result
class addTwoNumbers4(object):
result = 0
def __init__(self, number1, number2):
self.number1 = number1
self.number2 = number2
result = self.number1 + self.number2
# Test classes for adding two numbers:
addingObject1 = addTwoNumbers1(5,2)
print addingObject1.result
addingObject2 = addTwoNumbers2(5,2)
print addingObject2.result
addingObject3 = addTwoNumbers3(5,2)
print addingObject3.Add()
addingObject4 = addTwoNumbers4(5,2)
print addingObject4.result
对于每个BOX,此格式将继续用于框1和框2,总计约30000步。我有一些代码利用re模块根据关键字" STEP"搜索此文件。不幸的是,当我运行它时它不会产生任何结果。我需要我的代码来搜索1) ONLY 框1, 然后2)打印/输出在步骤1之后开始到文件的所有坐标(最好省略" C&S,F' s,H' s&#34 ;;只有坐标) , 3)增加" STEP"数字乘以48,然后重复2)。我也想忽略" 5"和" 240"在我正在搜索的文件中;因此代码应该进行补偿,以便在搜索此文件后不会将其包含在输出中。这是我到目前为止(它不起作用):
220
BOX 1, STEP 1
C 15.1760586379 13.7666285127 4.1579861659
F 13.7752750995 13.3845518556 4.1992254467
F 15.1122807811 15.0753387163 3.8457966464
H 15.5298304628 13.5873563855 5.1615910859
H 15.6594416869 13.1246597008 3.3754112615
5
BOX 2, STEP 1
C 15.1760586379 13.7666285127 4.1579861659
F 13.7752750995 13.3845518556 4.1992254467
F 15.1122807811 15.0753387163 3.8457966464
H 15.5298304628 13.5873563855 5.1615910859
H 15.6594416869 13.1246597008 3.3754112615
240
BOX 1, STEP 2
C 12.6851133069 2.8636250164 1.1788963097
F 11.7935769268 1.7912366066 1.3042188034
F 13.7887138736 2.3739304018 0.4126088380
H 12.1153838312 3.7024696077 0.7164304431
H 13.0962656950 3.1549047758 2.1436863477
C 12.6745394723 3.6338848332 15.1374252921
F 11.8703828307 4.3473226569 16.0480492173
F 12.2304604843 2.3709059503 14.9433964493
H 12.6002811971 4.1968554204 14.1449118786
H 13.7469256153 3.6086212350 15.5204655285
这是我的代码要做的一个例子:
import re
shakes = open("mc_coordinates", "r")
i = 1
for line in shakes:
if re.match("(.*)STEP i(.*)", line):
print line
i+=48
应该注意的是,这是一个浓缩版本,通常会在〜" STEP"之间有〜250行坐标。数字。任何想法或想法将不胜感激。谢谢!
答案 0 :(得分:0)
一种快速但有效的方法是逐行解析并添加一些状态。
# untested code, but i think you get the idea
import re
shakes = open("mc_coordinates", "r")
i = 1
output = False # are we in a block that should be output?
for line in shakes:
if re.match("(.*)STEP i(.*)", line): # tune this to match only for BOX 1
print line
output = true
i+=48
elif re.match("(.*)STEP i(.*)", line):
# some other box or step
output = false
elif output:
print line # or remove the first few chars to get rid of C,F or Hs.
答案 1 :(得分:0)
似乎最简单的方法是拥有两个正则表达式模式: 1.找到' BOX 1,STEP 48N + 1'串。 2.获取坐标。
我在下面提供了一些代码。 Haven没有尝试过你的东西,但应该很容易修复bug。基本上,你需要的是一个小型状态机,告诉你什么时候应该打印坐标
step_re = re.compile(r'BOX 1,\s+STEP (\d+)')
coord_re = re.compile(r'\s*(\d+.\d+)'*3)
in_step = False
for line in io.open('your_file.txt', rb):
if in_step:
coord_match = coord_re.search(line)
if coord_match:
print coord_match.group(1), coord_match.group(2), coord_match.group(3)
else:
in_step = False
continue
step_match = step_re.match(line)
if step_match and (int(step_match.group(1)) % 48) == 1:
print 'STEP {}'.format(step_match.group(1))
in_step = True