Question

我正在尝试使用re模块搜索相当大的文件的字符串。我正在搜索的文件格式如下：

class addTwoNumbers1(object):

    def __init__(self, number1, number2):
        self.number1 = number1
        self.number2 = number2
        self.result = number1 + number2


class addTwoNumbers2(object):

    def __init__(self, number1, number2):
        self.result = number1 + number2


class addTwoNumbers3(object):

    def __init__(self, number1, number2):
        self.number1 = number1
        self.number2 = number2

    def Add(self):
        result = self.number1 + self.number2
        return result


class addTwoNumbers4(object):

    result = 0

    def __init__(self, number1, number2):
        self.number1 = number1
        self.number2 = number2

    result = self.number1 + self.number2


# Test classes for adding two numbers:

addingObject1 = addTwoNumbers1(5,2)
print addingObject1.result

addingObject2 = addTwoNumbers2(5,2)
print addingObject2.result

addingObject3 = addTwoNumbers3(5,2)
print addingObject3.Add()

addingObject4 = addTwoNumbers4(5,2)
print addingObject4.result

对于每个BOX，此格式将继续用于框1和框2，总计约30000步。我有一些代码利用re模块根据关键字＆＃34; STEP＆＃34;搜索此文件。不幸的是，当我运行它时它不会产生任何结果。我需要我的代码来搜索1） ONLY 框1，然后2）打印/输出在步骤1之后开始到文件的所有坐标（最好省略＆＃34; C＆S，F＆＃39; s，H＆＃39; s＆＃34 ;;只有坐标）， 3）增加＆＃34; STEP＆＃34;数字乘以48，然后重复2）。我也想忽略＆＃34; 5＆＃34;和＆＃34; 240＆＃34;在我正在搜索的文件中;因此代码应该进行补偿，以便在搜索此文件后不会将其包含在输出中。这是我到目前为止（它不起作用）：

      220
      BOX 1,  STEP 1
      C        15.1760586379       13.7666285127        4.1579861659
      F        13.7752750995       13.3845518556        4.1992254467
      F        15.1122807811       15.0753387163        3.8457966464
      H        15.5298304628       13.5873563855        5.1615910859
      H        15.6594416869       13.1246597008        3.3754112615
        5
     BOX 2,  STEP 1
     C        15.1760586379       13.7666285127        4.1579861659
     F        13.7752750995       13.3845518556        4.1992254467
     F        15.1122807811       15.0753387163        3.8457966464
     H        15.5298304628       13.5873563855        5.1615910859
     H        15.6594416869       13.1246597008        3.3754112615
       240
     BOX 1,  STEP 2
     C        12.6851133069        2.8636250164        1.1788963097
     F        11.7935769268        1.7912366066        1.3042188034
     F        13.7887138736        2.3739304018        0.4126088380
     H        12.1153838312        3.7024696077        0.7164304431
     H        13.0962656950        3.1549047758        2.1436863477
     C        12.6745394723        3.6338848332       15.1374252921
     F        11.8703828307        4.3473226569       16.0480492173
     F        12.2304604843        2.3709059503       14.9433964493
     H        12.6002811971        4.1968554204       14.1449118786
     H        13.7469256153        3.6086212350       15.5204655285

这是我的代码要做的一个例子：

 import re
 shakes = open("mc_coordinates", "r")
 i = 1
 for line in shakes:
        if re.match("(.*)STEP i(.*)", line):
               print line
        i+=48

应该注意的是，这是一个浓缩版本，通常会在〜＆＃34; STEP＆＃34;之间有〜250行坐标。数字。任何想法或想法将不胜感激。谢谢！

Answer 1

一种快速但有效的方法是逐行解析并添加一些状态。

# untested code, but i think you get the idea
import re
shakes = open("mc_coordinates", "r")
i = 1
output = False # are we in a block that should be output?
for line in shakes:
    if re.match("(.*)STEP i(.*)", line): # tune this to match only for BOX 1
        print line
        output = true
        i+=48
    elif re.match("(.*)STEP i(.*)", line):
        # some other box or step
        output = false
    elif output:
        print line # or remove the first few chars to get rid of C,F or Hs.

Answer 2

似乎最简单的方法是拥有两个正则表达式模式： 1.找到＆＃39; BOX 1，STEP 48N + 1＆＃39;串。 2.获取坐标。

我在下面提供了一些代码。 Haven没有尝试过你的东西，但应该很容易修复bug。基本上，你需要的是一个小型状态机，告诉你什么时候应该打印坐标

step_re = re.compile(r'BOX 1,\s+STEP (\d+)')
coord_re = re.compile(r'\s*(\d+.\d+)'*3)
in_step = False
for line in io.open('your_file.txt', rb):
  if in_step:
    coord_match = coord_re.search(line)
    if coord_match:
      print coord_match.group(1), coord_match.group(2), coord_match.group(3)
    else:
      in_step = False
    continue

  step_match = step_re.match(line)
  if step_match and (int(step_match.group(1)) % 48) == 1:
    print 'STEP {}'.format(step_match.group(1))
    in_step = True

Python：使用re模块查找字符串，然后在字符串下打印值

2 个答案: