BeautifulSoup4数据从HTML5 data- *标签中提取

时间:2015-09-02 17:54:02

标签: python html5 python-3.x beautifulsoup bs4

我想从以下标记中提取内部文字 24,000.00

<span class="itm-price mrs  ">
     <span data-currency-iso="BDT">৳</span> 
     <span dir="ltr" data-price="24000">24,000.00</span> 
</span>

在我要提取数据的页面中有许多类似的标签。

我试图这样做:

    for price in soup.find_all('span', {'class': 'itm-price'}):
        item_price = price.get('data-price')
        print(item_price)

但输出即将到来:None

我从Bs4 doc了解到我们应该使用html5 data-*标记:

data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

由于我在这里非常新手,所以我仍然无法使用该方法带来resutls。

3 个答案:

答案 0 :(得分:2)

你可以试试这个

>>> import re
>>> from bs4 import BeautifulSoup
>>> html_doc = """
... <span class="itm-price mrs  ">
...      <span data-currency-iso="BDT">৳</span> 
...      <span dir="ltr" data-price="24000">24,000.00</span> 
... </span>
... <span class="itm-price mrs  ">
...      <span data-currency-iso="BDT">৳</span> 
...      <span dir="ltr" data-price="25000">25,000.00</span> 
... </span>
... <span class="itm-price mrs  ">
...     <span data-currency-iso="BDT">৳</span> 
...     <span dir="ltr" data-price="blabla">blabla</span> 
... </span>
... """
>>> soup = BeautifulSoup(html_doc, 'html.parser')
>>> soup.find("span", dir="ltr").attrs['data-price']

# You can loop over

>>> for price_span in soup.find_all("span", attrs={"dir": "ltr", "data-price": re.compile(r"\d+")}):
...     print(price_span.attrs.get("data-price", None))

# output
24000
25000

答案 1 :(得分:2)

为什么在可以直接访问所需内容时查找周围的<span>?此外,您可以使用关键字参数(虽然我理解为什么您不希望尝试使用class属性,因为它是Python关键字。)

get_test()方法将从一对匹配的标记之间提取内容,因此最终会得到一个非常简单的程序:

# coding=utf-8
data = u"""\
<span class="itm-price mrs  ">
     <span data-currency-iso="BDT">৳</span>
     <span dir="ltr" data-price="24000">24,000.00</span>
</span>
"""

import bs4
soup = bs4.BeautifulSoup(data)
for price in soup.find_all('span', dir="ltr"):
    print(price.get_text())

答案 2 :(得分:0)

使用find方法:

>>>from bs4 import BeautifulSoup
>>>url="""<span class="itm-price mrs  "><span data-currency-iso="BDT">৳</span><span dir="ltr" data-price="24000">24,000.00</span></span>"""
>>>soup.find("span",dir="ltr").string
'24,000.00'