Question

如何使用正则表达式从这样的HTML网页中提取答案“这是答案”？

  <b>Last Question:</b>
  <b>Here is the answer</b>

Answer 1

我知道正则表达式不建议解析HTML但回答你的问题，如果你使用php简单的html dom是你的朋友。 http://simplehtmldom.sourceforge.net/

Answer 2

谢谢大家！

这是我使用BeautifulSoup的解决方案，因为我使用的是Python框架：

  response = opener.open(url)
  the_page = response.read()

  soup = BeautifulSoup(''.join(the_page))
  paraText1 = soup.body.find('div', 'div_id', text = u'Last Question:')

  if paraText1:
    answer = paraText1.next

Answer 3

不要使用regexen来解析HTML。如果没有格式良好的SGML / XML / HTML5，你可以使用标记汤，这样做会加倍。

Answer 4

Don't use regex。使用像Jsoup这样的HTML解析器。

String html = "<b>Last Question:</b><b>Here is the answer</b>";
Document document = Jsoup.parse(html);
Element secondBold = document.select("b").get(1);
System.out.println(secondBold.text()); // Here is the answer

Jsoup是基于Java的。对于其他编程语言，还有HTML解析器可用。如果您使用的是C＃，请查看Nsoup。如果您正在使用PHP，请查看phpQuery（所有这些解析器都使用jQuery - 像CSS3选择器一样选择元素，这简直太棒了。）

Answer 5

正如查尔斯所说，不要使用正则表达式;如果您正在使用PHP，我建议使用内置的DOM解析功能，再加上XPath方法证明非常可靠。

如果你比我更开放，我建议使用jQuery通过Node.js完成工作，最近我自己做了很多 - 它让生活变得轻松。

Answer 6

<b>Last Question:</b>\\s*(<b>.*?</b>)

或者，更详细地说，

String x  ="<b>Last Question:</b>\n<b>Here is the answer</b>";
Pattern p = Pattern.compile("<b>Last Question:</b>\\s*(<b>.*?</b>)");
Matcher m = p.matcher(x);
if (m.find())
   System.out.println(m.group(1));

当HTML或类似标签不存在或随机出现而没有提供足够的上下文信息时，正则表达式仍然是一个选项。在这种情况下，我们需要研究人类语言中的一些词语。

如何使用正则表达式从HTML网页中提取信息？

6 个答案: