Question

我试图从以下方面提取给定国家/地区的所有酒店名称：https://www.holidaycheck.de/dh/hotels-tunesien/e10cef63-45d4-3511-92f1-43df5cbd9fe1。鉴于数据被分成几个页面，我试图建立一个循环 - 不幸的是我没有设法从htlm中提取寻呼机页数（最高页码）告诉我的循环停止。（我知道这个问题经常被问到答案，我读了所有帖子，但似乎没有解决我的问题）

html代码如下所示：

<div class="main-nav-items">
<span class="prev-next"
<span>
<i class="prev-arrow icon icon-left-arrow-line"></i>
<span>previous</span>
</span>
</a>
</span>
<span class="other-page">
<a class="link" href="/dh/hotels-tunesien/e10cef63-45d4-3511-92f1-43df5cbd9fe1">66</a>

我需要的是在最后一行代码的href之后的数字（在给定的情况下为66）

我尝试过：

data = soup.find_all('a', {'class':'link'})
y=str(data)
x=re.findall("[0-9]+",y)
print(x)

但是这段代码也给了我来自href的数字，如45和3511

另外我试过了：

data = soup.find_all('a', {'class':'link'})
numbers=([d.text for d in data])
print(numbers)

除此之外，这也很好用，包括下一个和前一个，并且我没有设法将输出转换为整数，我可以从中提取最大值并删除上一个和下一个

除此之外，我尝试了“while”，如下所述： scraping data from unknown number of pages using beautiful soup 但不知怎的，这并没有归还所有酒店和跳过页面......

如果有人能就如何解决我的问题给我一些建议，我将非常感激。谢谢！

Answer 1

html = '''<div class="main-nav-items">
<span class="prev-next"
<span>
<i class="prev-arrow icon icon-left-arrow-line"></i>
<span>previous</span>
</span>
</a>
</span>
<span class="other-page">
<a class="link" href="/dh/hotels-tunesien/e10cef63-45d4-3511-92f1-43df5cbd9fe1">66</a>'''

from bs4 import BeautifulSoup as BS

soup = BS(html, 'lxml')
data = soup.find_all('a', {'class':'link'})

res = []
for i in data:
    res.append(i.text) #writing each value to res list

res_int = []
for i in res:
    try:
        res_int.append(int(i))
    except:
        print("current value is not a number")
max(res_int)

使用漂亮的汤通过网址循环来抓取数据

1 个答案: