如何打印仅包含在三个标签中的文本?

时间:2016-03-12 23:37:24

标签: python html web-scraping beautifulsoup

我正在抓一个网站而且我被困在一个部分。我正在尝试仅打印包含在三个HTML标记中的文本。

这是我正在抓的一个小例子。

<h3>This is a header</h3>
<b>NOTE:</b> Important note that I don't need!<br><br>
<TABLE  width="100%" cellpadding="2">
<TR>
<TD COLSPAN="18" class = "subject_header">Also another thing that I don't need</TD>
</TR>
<TR>
<TD COLSPAN="18" class="Data"><br><font size=2 ><b>***THIS IS THE TEXT THAT I REALLY NEED!!!*** </b></font><BR> <p><b>Note: </b><i> And more text that I don't need </i></p> Some other text that I don't care about</TD>
</TR>
<TR>
<TD COLSPAN="5"><b><font color="red">And more stuff I don't need</font></b></TD>
<TD COLSPAN="18" class="Data"><br><font size=2 >Text that I don't need. </TD>
</TR>

我真正需要提取的文字是......

<TD COLSPAN="18" class="Data"><br><font size=2 ><b>THIS IS THE TEXT THAT I REALLY NEED!!! </b></font>

我尝试了很多东西,但是我尝试的所有东西都得到了所有的文字,而不仅仅是那个。

---------- ---------- EDIT

我忘记写html文件包含许多具有相同类的行,所以如果我尝试使用

soup.find_all("td", {"class":"Data"})

它不起作用。

还更新了html代码以显示我的意思的一个例子。请注意,我想要的那个有粗体标签(我很确定这会有所帮助)。

3 个答案:

答案 0 :(得分:2)

没有看到你的代码:

from bs4 import BeautifulSoup

soup = BeautifulSoup(your_html_object, "html.parser")
td = soup.find('td', {'class': "Data"})
print(td.b.text)

结果:

*** THIS IS THE TEXT THAT I REALLY NEED!!!***

答案 1 :(得分:1)

from bs4 import BeautifulSoup

html ='''<h3>This is a header</h3>
<b>NOTE:</b> Important note that I don't need!<br><br>
<TABLE  width="100%" cellpadding="2">
<TR>
<TD COLSPAN="18" class = "subject_header">Also another thing that I don't need</TD>
</TR>
<TR>
<TD COLSPAN="18" class="Data"><br><font size=2 ><b>***THIS IS THE TEXT THAT I REALLY NEED!!!*** </b></font><BR> <p><b>Note: </b><i> And more text that I don't need </i></p> Some other text that I don't care about</TD>
</TR>
<TR>
<TD COLSPAN="5"><b><font color="red">And more stuff I don't need</font></b></TD>
</TR>'''

soup = BeautifulSoup(html)
text = soup.find('td', class_='Data').b.text
print(text)

<强>输出

***THIS IS THE TEXT THAT I REALLY NEED!!!***

答案 2 :(得分:0)

还有一个更简洁的选项 - 使用CSS selector

soup.select_one("td.Data b").text

如果有多个元素与定位器匹配,并且您希望获得所有这些元素:

[elm.text for elm in soup.select("td.Data b")]