如何使用Beautiful Soup在注释标签之间提取文本?

时间:2015-06-04 14:24:41

标签: python html parsing beautifulsoup

我有以下HTML代码:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html><!-- InstanceBegin template="/Templates/BandDetails.dwt" codeOutsideHTMLIsLocked="false" -->
<head>
<!-- InstanceBeginEditable name="doctitle" -->
<title>&lt;BLR&gt;</title>
<!-- InstanceEndEditable -->
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<!-- InstanceBeginEditable name="head" --><!-- InstanceEndEditable -->
</head>

<body>
<div align="center">
  <table width="0" border="0" cellpadding="0" cellspacing="0" id="mainTable">
    <tr>
      <td colspan="2" id="navbar"><!--#include file="menu.htm" --></td>
    </tr>
    <tr>
      <td id="maincontent"><table width="0" border="0" cellpadding="0" cellspacing="0" id="contentInner">
        <tr>
          <td class="bodytext">
            <p></p><!-- InstanceBeginEditable name="bigPicture-378wide" --><img src="images/BLRlarge.jpg" alt="BLR" width="378" height="324" class="PictureFloatRight"><!-- InstanceEndEditable -->          
            <!-- InstanceBeginEditable name="DAYdateMonthYear" -->
            <p>Thursday 11th March 2010 </p>
            <!-- InstanceEndEditable -->

如何使用Beautiful Soup只提取注释标签中包含的文本? 例如,我想返回:

&LT; BLR&GT;

2010年3月11日星期四

感谢

2 个答案:

答案 0 :(得分:1)

您可能会发现此计划很有用:

from bs4 import BeautifulSoup
from bs4.element import Comment, NavigableString
html_doc = 'x.html'
soup = BeautifulSoup(open(html_doc))

# Identify the start comment
def isInstanceBeginEditable(text):
    return (isinstance(text, Comment) and
            text.strip().startswith("InstanceBeginEditable"))

# Identify the end comment
def isInstanceEndEditable(text):
    return (isinstance(text, Comment) and
            text.strip().startswith("InstanceEndEditable"))

# Look for start comments
for instanceBeginEditable in soup.find_all(text=isInstanceBeginEditable):
    # We found a start comment, look at all text and comments:
    for text in instanceBeginEditable.find_all_next(text=True):
        # We found a text or comment, examine it closely
        if isInstanceEndEditable(text):
            # We found the end comment, everybody out of the pool
            break
        if isinstance(text, Comment):
            # We found a comment, ignore
            continue
        if not text.strip():
            # We found a blank text, ignore
            continue
        # Whatever is left must be good
        print text

答案 1 :(得分:0)

import bs4
soup = bs4.BeautifulSoup(html_text)
soup.get_text().replace('\n','')