Question

我尝试从html代码中提取文本。这是我的代码：

import re
Luna = open('D:\Python\Luna.txt','r+')
text=Luna.read()
txt=re.findall('<p>\s+(.*)</p>',text)
print txt

但是，它只消除了第一个<p>之前的部分以及第一个<p>之后的所有内容。我该怎么做才能改进我的代码，以便它只返回<p>和</p>之间的部分？这是原始HTML代码的一部分：

src="/advjs/gg728x90.js"></script></td>  </tr></table><div class="text" align="justify"></p><p> Sure. Eye of newt. Tongue of snake.</p><p>  She added, &ldquo;Since you&rsquo;re taking Skills for Living, it&rsquo;ll be good practice.&rdquo;</p><p>  For what? I wondered. Poisoning my family? &ldquo;I have to baby-sit,&rdquo; I said, a little too gleefully.</p>

Answer 1

我强烈建议您使用正确的HTML解析器，例如BeautifulSoup：

from bs4 import BeautifulSoup

soup = BeautifulSoup(Luna.read())
para_strings = (p.get_text() for p in soup.find_all('p'))
txt = [p.strip() for p in para_strings if p.startswith(' ')]

您可以使用非贪婪的运算符修复正则表达式（向?运算符附加*个问号）：

txt=re.findall('<p>\s+(.*?)</p>',text)

但是，您很可能会遇到其他正则表达式解析问题，因为HTML不是常规语言。

为什么re.findall（）的正则表达式不起作用？

1 个答案: