我有一个表格,我正在提取链接和文本。虽然我只能做一个或另一个。知道如何获得两者吗?
基本上我需要提取文字:“提取这里的文字”
for tr in rows:
cols = tr.findAll('td')
count = len(cols)
if len(cols) >1:
third_column = tr.findAll('td')[2].contents
third_column_text = str(third_column)
third_columnSoup = BeautifulSoup(third_column_text)
#issue starts here. How can I get either the text of the elm <td>text here</td> or the href text<a href="somewhere.html">text here</a>
for elm in third_columnSoup.findAll("a"):
#print elm.text, third_columnSoup
item = { "code": random.upper(),
"name": elm.text }
items.insert(item )
HTML代码如下
<table cellpadding="2" cellspacing="0" id="ListResults">
<tbody>
<tr class="even">
<td colspan="4">sort results: <a href=
"/~/search/af.aspx?some=LOL&Category=All&Page=0&string=&s=a"
rel="nofollow" title=
"sort results in alphabetical order">alphabetical</a> | <strong>rank</strong> <a href="/as.asp#Rank">?</a></td>
</tr>
<tr class="even">
<th>aaa</th>
<th>vvv.</th>
<th>gdfgd</th>
<td></td>
</tr>
<tr class="odd">
<td align="right" width="32">******</td>
<td nowrap width="60"><a href="/aaa.html" title=
"More info and direct link for this meaning...">AAA</a></td>
<td>TEXT TO EXTRACT HERE</td>
<td width="24"></td>
</tr>
<tr class="even">
<td align="right" width="32">******</td>
<td nowrap width="60"><a href="/someLink.html"
title="More info and direct link for this meaning...">AAA</a></td>
<td><a href=
"http://www.fdssfdfdsa.com/aaa">TEXT TO EXTRACT HERE</a></td>
<td width="24">
<a href=
"/~/search/google.aspx?q=lhfjl&f=a&cx=partner-pub-2259206618774155:1712475319&cof=FORID:10&ie=UTF-8"><img border="0"
height="21" src="/~/st/i/find2.gif" width="21"></a>
</td>
</tr>
<tr>
<td width="24"></td>
</tr>
<tr>
<td align="center" colspan="4" style="padding-top:6pt">
<b>Note:</b> We have 5575 other definitions for <strong><a href=
"http://www.ddfsadfsa.com/aaa.html">aaa</a></strong> in our
database</td>
</tr>
</tbody>
</table>
答案 0 :(得分:1)
您可以在text
元素上使用td
属性:
from bs4 import BeautifulSoup
html = """HERE GOES THE HTML"""
soup = BeautifulSoup(html, 'html.parser')
for tr in soup.find_all('tr'):
columns = tr.find_all('td')
if len(columns) > 2:
print columns[2].text
打印:
TEXT TO EXTRACT HERE
TEXT TO EXTRACT HERE
希望有所帮助。
答案 1 :(得分:0)
执行此操作的方法是执行以下操作:
third_column = tr.find_all('td')[2] .contents
third_column_text = str(third_column)
third_columnSoup = BeautifulSoup(third_column_text)
if third_columnSoup:
print third_columnSoup.text