如何使用Python抓取<div>标签?

时间:2020-06-23 00:42:52

标签: python html

我正在尝试从具有以下html结构的列表网站中抓取数据

 <div class="ListingCell-AllInfo ListingUnit" data-bathrooms="1" data-bedrooms="1" data-block="21st Floor" data-building_size="31" data-category="condominium" data-condominiumname="Twin Lakes Countrywoods" data-price="6000000" data-subcategories='["condominium","single-bedroom"]'>
      <div class="ListingCell-TitleWrapper">
       <h3 class="ListingCell-KeyInfo-title" title="Twin Lakes Countrywoods 1BR Unit for Sale, Tagaytay">
        <a class="js-listing-link" data-position="8" data-sku="CD5E17CED0347ECPH" href="https://www.lamudi.com.ph/twin-lakes-countrywoods-1br-unit-for-sale-tagaytay-2.html" target="_blank" title="Twin Lakes Countrywoods 1BR Unit for Sale, Tagaytay">
         Twin Lakes Countrywoods 1BR Unit for Sale, Tagaytay
        </a>
       </h3>
       <div class="ListingCell-KeyInfo-address ellipsis">
        <a class="js-listing-link ellipsis" data-position="8" data-sku="CD5E17CED0347ECPH" href="https://www.lamudi.com.ph/twin-lakes-countrywoods-1br-unit-for-sale-tagaytay-2.html" target="_blank" title="Twin Lakes Countrywoods 1BR Unit for Sale, Tagaytay">
         <span class="icon-pin">
         </span>
         <span>
          Tagaytay Hi-Way

                                Dayap Itaas, Laurel
         </span>
        </a>
       </div>

我想得到的是

我尝试使用Python BeautifulSoup进行抓取

details = container.find('div',class_="ListingCell-AllInfo ListingUnit").text if container.find('div',class_="ListingCell-AllInfo ListingUnit") else "-"

所有列表均返回“-”。在这里完成新手!

2 个答案:

答案 0 :(得分:0)

您可以使用美丽的汤,因为它总是对我有用。

 req = Request("put your url here",headers={'User-Agent': 'Mozilla/5.0'})
 webpage = urlopen(req).read()
 soup = BeautifulSoup(webpage)

 title = soup.find_all('tag you want to scrape', class_='class of that tag')

访问链接以获取更多信息:https://pypi.org/project/beautifulsoup4/

答案 1 :(得分:0)

有!您可以使用正则表达式解决问题

我在解决方案中介绍了一些评论,但要了解更多信息, 看看official documentation 或阅读this

import re # regular expression module

txt = """insert your html here"""

# we create a regex patern called p1 and this that will match a string starting with
# <div class="ListingCell-AllInfo ListingUnit"
# following by anything (any character) found 0 or more times
# and the string must end by '>'
p1 = re.compile(r'<div class="ListingCell-AllInfo ListingUnit".*>')

# findall return a list of strings that matches the patern p1 in txt
ls = p1.findall(txt)

# now, what you want is the data, so we can create another patern where the word
# "data" will be found

# match string starting with data following by '-' then by 0 or more alphanumeric char
# then with '=' then with any character found in after the '=' that is not not
# a space, a tab 

p2 =  re.compile(r'(data-\w*=\S*)')
data = p2.findall(ls[0])

print(data)

注意:别被那些看起来比实际情况更糟的时髦符号所吓倒