我想捕获两个正则表达式匹配之间出现的文本

时间:2019-06-19 17:14:47

标签: python regex

例如,这是我的字符串(它是来自html的文本)

html_text = """
TABLE OF CONTENTS

PART I  
| ITEM 1. BUSINESS  
| ITEM 1A. RISK FACTORS  
| ITEM 1B. UNRESOLVED CONFLICTS  
| ITEM 2. PROPERTIES  
| ITEM 3. LEGAL PROCEEDINGS  

    We believe that relations with our employees are good; however, the competition
    for such personnel is intense, and the loss of key personnel could have a
    material adverse impact on our results of operations and financial condition.

    ITEM  1A. |  RISK FACTORS  

    Set forth below and elsewhere in this report and in other documents we file
    with the SEC are descriptions of the risks and uncertainties that could cause
    our actual results to differ materially from the results contemplated by the
    forward-looking statements contained in this report.

    ITEM 1B. UNRESOLVED CONFLICTS

    Our future revenue, gross margins, operating results and net income are
    difficult to predict and may materially"""

我写了一个正则表达式来捕获“ ITEM 1A。风险因素”(不是来自目录

re.search(r"(ITEM.*1A)*.+(RISK FACTORS).*\n+(?!\w)(?!.*ITEM.*1B)", html_text)

和另一个正则表达式捕获“ ITEM 1B。未解决的冲突”(不是来自目录

re.search(still trying to figure this out)

我想捕获这两个匹配之间出现的所有文本。 最终的文本字符串应如下所示:

final_text = """    ITEM  1A. |  RISK FACTORS  

    Set forth below and elsewhere in this report and in other documents we file
    with the SEC are descriptions of the risks and uncertainties that could cause
    our actual results to differ materially from the results contemplated by the
    forward-looking statements contained in this report."""

2 个答案:

答案 0 :(得分:0)

这可能对您有用:

re.compile(r"^(    ITEM  1A. \|  RISK FACTORS.+\n(?:\n.+)+)", re.MULTILINE)

可以在Regex101处看到,但是请注意,由于未使用re.compile(REGEXP, REGEXPOPTION)设置,因此在那儿工作有所不同。

答案 1 :(得分:0)

如果您想与ITEM 1A. RISK FACTORS进行匹配,然后再与下一个ITEM 1B. UNRESOLVED CONFLICTS进行匹配,然后再对所有接下来的ITEM进行单独匹配,则可以使用带有负前瞻的重复模式来匹配所有后续不以^[ \t]+ITEM[ \t]+\d+[A-Z]\.[ \t]+.*(?:\r?\n(?!.*\bITEM[ \t]+\d+[A-Z]).*)* 开头的行,空格,数字和大写字符。

^

说明

  • [ \t]+ITEM
  • ITEM匹配1个以上空格或制表符,后跟[ \t]+\d+[A-Z]\.
  • [ \t]+.*匹配1+个空格或制表符,后跟1+个数字和大写字符
  • (?:匹配1个以上空格或制表符,后跟除换行符以外的任何字符
  • \r?\n非捕获组
    • (?!数学换行符
    • .*\bITEM[ \t]+\d+[A-Z]负向前进,断言右边的不是
      • )匹配“ ITEM”,后跟制表符或空格,1个以上的数字和一个大写字符
    • .*近距离否定预测
    • )*匹配除换行符0次以上以外的所有字符
  • regex = r"^[ \t]*ITEM[ \t]*\d+[A-Z]\.[ \t]+.*(?:\r?\n(?!.*ITEM \d+[A-Z]).*)*" print(re.findall(regex, html_text, re.MULTILINE)) 关闭非捕获组并重复0次以上

查看regex demo | Python demo

例如:

[ \t]+

注意:如果仅使用空格,则可以将+的部分缩短为一个空格,后跟加号^ +ITEM +\d+[A-Z]\. +.*(?:\r?\n(?!.*\bITEM +\d+[A-Z]).*)*

<form role="form" class="form" action='http://vilcabamba-hotel.com/newsite/pages-html/contact_form.php' method='post'>
                <div class='form-group col-centered row'>
                    <label for='name' class="col-2 col-form-label">Name:</label>
                    <div class="col-10">
                    <input type='text' class='form-control' id='name' placeholder='Name' name='name' required >
                    </div>
                </div>
                <div class='form-group col-centered row'>
                    <label for='email' class="col-2 col-form-label" >Email:</label>
                    <div class="col-10">
                    <input type='email' class='form-control' id='email' placeholder='Email' name='email' required >
                    </div>
                </div>
                <div class='form-group col-centered row'>
                    <label for='numpeople' class="col-2 col-form-label" >Number of people:</label>
                    <div class="col-10">
                    <input type='number' class='form-control' id='numpeople' placeholder='Number of people' name='numpeople'  >
                    </div>
                </div>
                <div class="form-group col-centered row">
                    <label class="col-2 col-form-label"for="cabin">Cabin Preference</label>
                    <div class="col-10">
                    <select class="form-control" name='cabin' id="cabin">
                      <option value="any">Any</option>
                      <option value="Hummingbird Suite">Hummingbird Suite</option>
                      <option value="Eagles Lair">Eagles Lair</option>
                      <option value="Eagles Nests">Eagles Nests</option>
                      <option value="Riverside Cottage">Riverside Cottage</option>
                      <option value="Riverside House">Riverside House</option>
                    </select>
                    </div>
                </div>
              <div class="form-group col-centered row">
                    <label for="arrivalDate" class="col-2 col-form-label">Arrival Date</label>
                    <div class="col-10">
                    <input type="date" id="arrivalDate" name="arrivalDate">
                    </div>                                      
                </div>  
              <div class="form-group col-centered row">
                    <label for="departureDate" class="col-2 col-form-label">Departure Date</label>
                    <div class="col-10">
                    <input type="date" id="departureDate" name="departureDate">
                    </div>                                      
                </div>                                                  
                <div class='form-group col-centered row'>
                    <label for='comment' class="col-2 col-form-label">Message:</label>
                    <div class="col-10">
                    <textarea class='form-control' rows='2' name='comment' placeholder='Message' required ></textarea>
                    </div>

                </div>

                <div class="form-group row ">
                <div class="col-12"> 
                <div class="captcha_wrapper">
                    <div class="g-recaptcha" data-sitekey="6LeaDKkUAAAAAHUwQjUoD_Oj57gCnTXkZvoTrr9B"></div>
                </div>
                </div>
              </div>
                <input class='btn checkout' value='SEND' type='submit' name="submit">       
            </form>