我有以下HTML代码:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html><!-- InstanceBegin template="/Templates/BandDetails.dwt" codeOutsideHTMLIsLocked="false" -->
<head>
<!-- InstanceBeginEditable name="doctitle" -->
<title><BLR></title>
<!-- InstanceEndEditable -->
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<!-- InstanceBeginEditable name="head" --><!-- InstanceEndEditable -->
</head>
<body>
<div align="center">
<table width="0" border="0" cellpadding="0" cellspacing="0" id="mainTable">
<tr>
<td colspan="2" id="navbar"><!--#include file="menu.htm" --></td>
</tr>
<tr>
<td id="maincontent"><table width="0" border="0" cellpadding="0" cellspacing="0" id="contentInner">
<tr>
<td class="bodytext">
<p></p><!-- InstanceBeginEditable name="bigPicture-378wide" --><img src="images/BLRlarge.jpg" alt="BLR" width="378" height="324" class="PictureFloatRight"><!-- InstanceEndEditable -->
<!-- InstanceBeginEditable name="DAYdateMonthYear" -->
<p>Thursday 11th March 2010 </p>
<!-- InstanceEndEditable -->
如何使用Beautiful Soup只提取注释标签中包含的文本? 例如,我想返回:
&LT; BLR&GT;
2010年3月11日星期四
感谢
答案 0 :(得分:1)
您可能会发现此计划很有用:
from bs4 import BeautifulSoup
from bs4.element import Comment, NavigableString
html_doc = 'x.html'
soup = BeautifulSoup(open(html_doc))
# Identify the start comment
def isInstanceBeginEditable(text):
return (isinstance(text, Comment) and
text.strip().startswith("InstanceBeginEditable"))
# Identify the end comment
def isInstanceEndEditable(text):
return (isinstance(text, Comment) and
text.strip().startswith("InstanceEndEditable"))
# Look for start comments
for instanceBeginEditable in soup.find_all(text=isInstanceBeginEditable):
# We found a start comment, look at all text and comments:
for text in instanceBeginEditable.find_all_next(text=True):
# We found a text or comment, examine it closely
if isInstanceEndEditable(text):
# We found the end comment, everybody out of the pool
break
if isinstance(text, Comment):
# We found a comment, ignore
continue
if not text.strip():
# We found a blank text, ignore
continue
# Whatever is left must be good
print text
答案 1 :(得分:0)
import bs4
soup = bs4.BeautifulSoup(html_text)
soup.get_text().replace('\n','')