在Python

时间:2017-02-15 14:46:45

标签: python regex findall

我在Python中有以下两个片段(short_sentencelong_sentence的一部分)

short_sentence = '<p data-reactid="389">THE prospect of deregulation helps explain why, since Donald Trump\xe2\x80\x99s election, no bit of the American stockmarket has done better than financial firms. On February 3rd their shares climbed again as Mr Trump signed an executive order asking the Treasury to conduct a 120-day review of America\xe2\x80\x99s financial regulations, including the Dodd-Frank act put in place after the financial crisis of 2007-08, to assess whether these rules meet a set of \xe2\x80\x9ccore principles\xe2\x80\x9d.</p>'

long_sentence = '<description>&lt;img src=&quot;http://cdn.static-economist.com/sites/default/files/images/print-edition/20170211_LDC811.png&quot; alt=&quot;&quot; title=&quot;&quot; height=&quot;376&quot; width=&quot;458&quot; class=&quot; blog-post-article-image blog-post-article-image__slim&quot; data-reactid=&quot;388&quot;/&gt;&lt;p data-reactid=&quot;389&quot;&gt;THE prospect of deregulation helps explain why, since Donald Trump\xe2\x80\x99s election, no bit of the American stockmarket has done better than financial firms. On February 3rd their shares climbed again as Mr Trump signed an executive order asking the Treasury to conduct a 120-day review of America\xe2\x80\x99s financial regulations, including the Dodd-Frank act put in place after the financial crisis of 2007-08, to assess whether these rules meet a set of \xe2\x80\x9ccore principles\xe2\x80\x9d.&lt;/p&gt;&lt;p data-reactid=&quot;390&quot;&gt;To critics of Dodd-Frank, this is thrilling stuff. They see the law as a piece of statist overreach that throttles the American economy. Plenty in the Trump administration would love to gut it. The president himself has called it a \xe2\x80\x9cdisaster\xe2\x80\x9d. Gary Cohn, until recently one of the leaders of Goldman Sachs, a big bank, and now Mr Trump\xe2\x80\x99s chief economic adviser, promises to \xe2\x80\x9cattack all aspects of Dodd-Frank\xe2\x80\x9d.&lt;/p&gt;'

我想解析&lt; + anything + *&gt;&lt;/p&gt;字符串之间的每个(最短的)子串。我知道在short_sentence中有一个这样的事件:

THE prospect of deregulation helps explain why, since Donald Trump\xe2\x80\x99s election, no bit of the American stockmarket has done better than financial firms. On February 3rd their shares climbed again as Mr Trump signed an executive order asking the Treasury to conduct a 120-day review of America\xe2\x80\x99s financial regulations, including the Dodd-Frank act put in place after the financial crisis of 2007-08, to assess whether these rules meet a set of \xe2\x80\x9ccore principles\xe2\x80\x9d.

在long_sentence中,有一个上面和另一个:

To critics of Dodd-Frank, this is thrilling stuff. They see the law as a piece of statist overreach that throttles the American economy. Plenty in the Trump administration would love to gut it. The president himself has called it a \xe2\x80\x9cdisaster\xe2\x80\x9d. Gary Cohn, until recently one of the leaders of Goldman Sachs, a big bank, and now Mr Trump\xe2\x80\x99s chief economic adviser, promises to \xe2\x80\x9cattack all aspects of Dodd-Frank\xe2\x80\x9d.

我知道Python的re.findall()会回复所有匹配文本的子文本。当我尝试执行以下操作时:

re.findall("&lt;p.*&gt;(.*?)&lt;/p&gt;", short_sentence)

我得到了正确的假设结果:

['THE prospect of deregulation helps explain why, since Donald Trump\xe2\x80\x99s election, no bit of the American stockmarket has done better than financial firms. On February 3rd their shares climbed again as Mr Trump signed an executive order asking the Treasury to conduct a 120-day review of America\xe2\x80\x99s financial regulations, including the Dodd-Frank act put in place after the financial crisis of 2007-08, to assess whether these rules meet a set of \xe2\x80\x9ccore principles\xe2\x80\x9d.']

同时,当我尝试使用以下内容解析long_sentence中的两个子字符串时:

re.findall("&lt;p.*&gt;(.*?)&lt;/p&gt;", long_sentence)

我仍然只有一次出现(第二次出现):

['To critics of Dodd-Frank, this is thrilling stuff. They see the law as a piece of statist overreach that throttles the American economy. Plenty in the Trump administration would love to gut it. The president himself has called it a \xe2\x80\x9cdisaster\xe2\x80\x9d. Gary Cohn, until recently one of the leaders of Goldman Sachs, a big bank, and now Mr Trump\xe2\x80\x99s chief economic adviser, promises to \xe2\x80\x9cattack all aspects of Dodd-Frank\xe2\x80\x9d.']

我的问题是:在第二个实例中出现了什么问题?为什么不将它们同时归还?

1 个答案:

答案 0 :(得分:0)

p.*贪婪,所以它需要一切。如果您使用p.*?,您将获得预期的结果。

如果您需要,可以在此处获取有关该主题的更多信息:http://www.regular-expressions.info/repeat.html

摘录:

  

假设您要使用正则表达式来匹配HTML标记。您知道输入将是有效的HTML文件,因此正则表达式不需要排除任何无效的尖括号使用。如果它位于尖括号之间,则它是HTML标记。

     

大多数正常表达新手会尝试使用&lt;。+&gt;。当他们在像这是第一次测试的字符串上测试时,他们会感到惊讶。您可能希望正则表达式与匹配,并在该匹配后继续