我想从一堆html表中剔除一些数据价格。这些表包含各种价格,当然表格数据标签不包含任何有用的价格。
<div id="item-price-data">
<table>
<tbody>
<tr>
<td class="some-class">Normal Price:</td>
<td class="another-class">$100.00</td>
</tr>
<tr>
<td class="some-class">Member Price:</td>
<td class="another-class">$90.00</td>
</tr>
<tr>
<td class="some-class">Sale Price:</td>
<td class="another-class">$80.00</td>
</tr>
<tr>
<td class="some-class">You save:</td>
<td class="another-class">$20.00</td>
</tr>
</tbody>
</table>
</div>
我关心的唯一价格是那些与“正常价格”元素配对的价格。
我希望能够做的是扫描表格的后代,找到包含该文本的<td>
标记,然后从其兄弟中提取文本。
我遇到的问题是,在BeautifulSoup中,descendants
属性会返回NavigableString
的列表,而不是Tag
。
所以,如果我这样做:
from bs4 import BeautifulSoup
from urllib import request
html = request.urlopen(url)
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', {'id': 'item-price-data'})
table_data = div.find_all('td')
for element in table_data:
if element.get_text() == 'Normal Price:':
price = element.next_sibling
print(price)
我一无所获。有没有简单的方法来获取字符串值?
答案 0 :(得分:1)
您可以使用find_next()
方法,也可能需要一些正则表达式:
演示:
>>> import re
>>> from bs4 import BeautifulSoup
>>> html = """<div id="item-price-data">
... <table>
... <tbody>
... <tr>
... <td class="some-class">Normal Price:</td>
... <td class="another-class">$100.00</td>
... </tr>
... <tr>
... <td class="some-class">Member Price:</td>
... <td class="another-class">$90.00</td>
... </tr>
... <tr>
... <td class="some-class">Sale Price:</td>
... <td class="another-class">$80.00</td>
... </tr>
... <tr>
... <td class="some-class">You save:</td>
... <td class="another-class">$20.00</td>
... </tr>
... </tbody>
... </table>
... </div>"""
>>> soup = BeautifulSoup(html, 'lxml')
>>> div = soup.find('div', {'id': 'item-price-data'})
>>> for element in div.find_all('td', text=re.compile('Normal Price')):
... price = element.find_next('td')
... print(price)
...
<td class="another-class">$100.00</td>
如果您不想将正则表达式带入此,那么以下内容对您有用。
>>> table_data = div.find_all('td')
>>> for element in table_data:
... if 'Normal Price' in element.get_text():
... price = element.find_next('td')
... print(price)
...
<td class="another-class">$100.00</td>