我正在尝试从具有以下html结构的列表网站中抓取数据
<div class="ListingCell-AllInfo ListingUnit" data-bathrooms="1" data-bedrooms="1" data-block="21st Floor" data-building_size="31" data-category="condominium" data-condominiumname="Twin Lakes Countrywoods" data-price="6000000" data-subcategories='["condominium","single-bedroom"]'>
<div class="ListingCell-TitleWrapper">
<h3 class="ListingCell-KeyInfo-title" title="Twin Lakes Countrywoods 1BR Unit for Sale, Tagaytay">
<a class="js-listing-link" data-position="8" data-sku="CD5E17CED0347ECPH" href="https://www.lamudi.com.ph/twin-lakes-countrywoods-1br-unit-for-sale-tagaytay-2.html" target="_blank" title="Twin Lakes Countrywoods 1BR Unit for Sale, Tagaytay">
Twin Lakes Countrywoods 1BR Unit for Sale, Tagaytay
</a>
</h3>
<div class="ListingCell-KeyInfo-address ellipsis">
<a class="js-listing-link ellipsis" data-position="8" data-sku="CD5E17CED0347ECPH" href="https://www.lamudi.com.ph/twin-lakes-countrywoods-1br-unit-for-sale-tagaytay-2.html" target="_blank" title="Twin Lakes Countrywoods 1BR Unit for Sale, Tagaytay">
<span class="icon-pin">
</span>
<span>
Tagaytay Hi-Way
Dayap Itaas, Laurel
</span>
</a>
</div>
我想得到的是
我尝试使用Python BeautifulSoup进行抓取
details = container.find('div',class_="ListingCell-AllInfo ListingUnit").text if container.find('div',class_="ListingCell-AllInfo ListingUnit") else "-"
所有列表均返回“-”。在这里完成新手!
答案 0 :(得分:0)
您可以使用美丽的汤,因为它总是对我有用。
req = Request("put your url here",headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage)
title = soup.find_all('tag you want to scrape', class_='class of that tag')
访问链接以获取更多信息:https://pypi.org/project/beautifulsoup4/
答案 1 :(得分:0)
有!您可以使用正则表达式解决问题
我在解决方案中介绍了一些评论,但要了解更多信息, 看看official documentation 或阅读this
import re # regular expression module
txt = """insert your html here"""
# we create a regex patern called p1 and this that will match a string starting with
# <div class="ListingCell-AllInfo ListingUnit"
# following by anything (any character) found 0 or more times
# and the string must end by '>'
p1 = re.compile(r'<div class="ListingCell-AllInfo ListingUnit".*>')
# findall return a list of strings that matches the patern p1 in txt
ls = p1.findall(txt)
# now, what you want is the data, so we can create another patern where the word
# "data" will be found
# match string starting with data following by '-' then by 0 or more alphanumeric char
# then with '=' then with any character found in after the '=' that is not not
# a space, a tab
p2 = re.compile(r'(data-\w*=\S*)')
data = p2.findall(ls[0])
print(data)
注意:别被那些看起来比实际情况更糟的时髦符号所吓倒