我有一个普通文件格式的大型html菜单文件,我需要获得每个菜单项的最高价格。这是菜单文件的一个示例:
### File Name: "menu" (All types ".") ###
</div>
<div class="menu-item-prices">
<table>
<tr>
<td class="menu-item-price-amount">
10
</td>
<td class="menu-item-price-amount">
14
</td>
</tr>
</div>
</div>
<div class="menu-item-prices">
<table>
<tr>
<td class="menu-item-price-amount">
100
</td>
<td class="menu-item-price-amount">
1
</td>
</tr>
</div>
我需要我的程序返回每个菜单项中的最高价格列表,即此示例的maxprices = ['14','100']。我在Python中尝试了以下代码:
#!/user/bin/python
from lxml import html
from os.path import join, dirname, realpath
from lxml.etree import XPath
def main():
""" Drive function """
fpath = join(dirname(realpath(__file__)), 'menu')
hfile = open(fpath) # open html file
tree = html.fromstring(hfile.read())
prices_path = XPath('//*[@class="menu-item-prices"]/table/tr')
maxprices = []
for p in prices_path(tree):
prices = p.xpath('//td/text()')
prices = [el.strip() for el in prices]
maxprice = max(prices)
maxprices.append(maxprice)
print maxprices
if __name__ == '__main__':
main()
我也试过
prices = tree.xpath('//*[@class="menu-item-prices"]'
'//tr[not(../tr/td > td)]/text()')
prices = [el.strip() for el in prices]
而不是上面的循环策略。不返回每个类别的必要最高价格。如何修改我的代码以正确获取这些价格?谢谢。
答案 0 :(得分:1)
至少有一个问题 - 您比较字符串但需要将价格转换为float
,然后获得每个表格行的最大值。
完整示例:
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
from lxml.html import fromstring
data = """
<div>
<div class="menu-item-prices">
<table>
<tr>
<td class="menu-item-price-amount">
10
</td>
<td class="menu-item-price-amount">
14
</td>
</tr>
</table>
</div>
<div class="menu-item-prices">
<table>
<tr>
<td class="menu-item-price-amount">
100
</td>
<td class="menu-item-price-amount">
1
</td>
</tr>
</table>
</div>
</div>
"""
tree = fromstring(data)
for item in tree.xpath("//div[@class='menu-item-prices']/table/tr"):
prices = [float(price.strip()) for price in item.xpath(".//td[@class='menu-item-price-amount']/text()")]
print(max(prices))
打印:
14.0
100.0