我正在抓一个网站而且我被困在一个部分。我正在尝试仅打印包含在三个HTML标记中的文本。
这是我正在抓的一个小例子。
<h3>This is a header</h3>
<b>NOTE:</b> Important note that I don't need!<br><br>
<TABLE width="100%" cellpadding="2">
<TR>
<TD COLSPAN="18" class = "subject_header">Also another thing that I don't need</TD>
</TR>
<TR>
<TD COLSPAN="18" class="Data"><br><font size=2 ><b>***THIS IS THE TEXT THAT I REALLY NEED!!!*** </b></font><BR> <p><b>Note: </b><i> And more text that I don't need </i></p> Some other text that I don't care about</TD>
</TR>
<TR>
<TD COLSPAN="5"><b><font color="red">And more stuff I don't need</font></b></TD>
<TD COLSPAN="18" class="Data"><br><font size=2 >Text that I don't need. </TD>
</TR>
我真正需要提取的文字是......
<TD COLSPAN="18" class="Data"><br><font size=2 ><b>THIS IS THE TEXT THAT I REALLY NEED!!! </b></font>
我尝试了很多东西,但是我尝试的所有东西都得到了所有的文字,而不仅仅是那个。
---------- ---------- EDIT
我忘记写html文件包含许多具有相同类的行,所以如果我尝试使用
soup.find_all("td", {"class":"Data"})
它不起作用。
还更新了html代码以显示我的意思的一个例子。请注意,我想要的那个有粗体标签(我很确定这会有所帮助)。
答案 0 :(得分:2)
没有看到你的代码:
from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_object, "html.parser")
td = soup.find('td', {'class': "Data"})
print(td.b.text)
结果:
*** THIS IS THE TEXT THAT I REALLY NEED!!!***
答案 1 :(得分:1)
from bs4 import BeautifulSoup
html ='''<h3>This is a header</h3>
<b>NOTE:</b> Important note that I don't need!<br><br>
<TABLE width="100%" cellpadding="2">
<TR>
<TD COLSPAN="18" class = "subject_header">Also another thing that I don't need</TD>
</TR>
<TR>
<TD COLSPAN="18" class="Data"><br><font size=2 ><b>***THIS IS THE TEXT THAT I REALLY NEED!!!*** </b></font><BR> <p><b>Note: </b><i> And more text that I don't need </i></p> Some other text that I don't care about</TD>
</TR>
<TR>
<TD COLSPAN="5"><b><font color="red">And more stuff I don't need</font></b></TD>
</TR>'''
soup = BeautifulSoup(html)
text = soup.find('td', class_='Data').b.text
print(text)
<强>输出强>
***THIS IS THE TEXT THAT I REALLY NEED!!!***
答案 2 :(得分:0)
还有一个更简洁的选项 - 使用CSS selector:
soup.select_one("td.Data b").text
如果有多个元素与定位器匹配,并且您希望获得所有这些元素:
[elm.text for elm in soup.select("td.Data b")]