Question

我正在尝试从网页中抓取数据，然后通过将href提取到下一页来转到下一页。

但是，在这种情况下，包含指向下一页的href的标签是href ='＃next'。在使用Chrome浏览器检查了此元素后，当我将鼠标悬停在单词“ #next”上时，它似乎是一个超链接，向我显示了完整的href。

我怀疑一旦发出请求并将其转换为文本，href就会丢失；

r = requests.get(url)

s = BeautifulSoup(r.text)

我使用findAll()函数来获取想要的元素：

s.findAll('a', class_='pagenav')[5]

结果：

a href="#next" class="pagenav" title="next page" onclick="javascript:
document.pageForm.limitstart.value=20; document.pageForm.submit();return false;">
Next&nbsp;&gt;

在这种情况下如何获取href？

这是网站的链接

https://associatedrealtorsaruba.com/index.php?option=com_ezrealty&Itemid=11&task=results&cnid=0&custom7=&custom8=&parking=&type=0&cid=0&stid=0&locid=0&minprice=&maxprice=&minbed=&maxbed=&min_squarefeet=&max_squarefeet=&bathrooms=&sold=0&lug=0&featured=0&custom4=&custom5=&custom6=&postcode=&radius=&direction=DEFAULT&submit=Search

Answer 1

如果您使用Selenium，然后使用Selenium查找<a class="pagenav">或<a title="next page">，然后.click()即可加载下一页，而您不必href。

import selenium.webdriver

url = 'https://associatedrealtorsaruba.com/index.php?option=com_ezrealty&Itemid=11&task=results&cnid=0&custom7=&custom8=&parking=&type=0&cid=0&stid=0&locid=0&minprice=&maxprice=&minbed=&maxbed=&min_squarefeet=&max_squarefeet=&bathrooms=&sold=0&lug=0&featured=0&custom4=&custom5=&custom6=&postcode=&radius=&direction=DEFAULT&submit=Search'

driver = selenium.webdriver.Firefox()
driver.get(url)

# find link to next page
next_page = driver.find_element_by_xpath('//a[@title="next page"]')

# click link to load next page
next_page.click()

顺便说一句：：如果您手动加载第1、2和3页并在浏览器中比较它们的网址，那么您将看到网址的唯一区别

for page 1: &limitstart=0 
for page 2: &limitstart=20 
for page 3: &limitstart=40

这是在不获取href的情况下加载下一页的方法-您必须获取原始网址并添加具有正确值的&limitstart=才能加载不同的页面。

如果要在页面上显示50个项目，则必须使用&limit=50，然后&limitstart将使用值0、50、100等。

编辑：

有请求

import requests
from bs4 import BeautifulSoup as BS

url = 'https://associatedrealtorsaruba.com/index.php?option=com_ezrealty&Itemid=11&task=results&cnid=0&custom7=&custom8=&parking=&type=0&cid=0&stid=0&locid=0&minprice=&maxprice=&minbed=&maxbed=&min_squarefeet=&max_squarefeet=&bathrooms=&sold=0&lug=0&featured=0&custom4=&custom5=&custom6=&postcode=&radius=&direction=DEFAULT&submit=Search'

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0', # need full UA
}

for x in (0, 20, 40):
    r = requests.get(url + '&limitstart={}'.format(x), headers=headers)
    print('\n---', x, '---\n')

    soup = BS(r.text, 'html.parser')

    all_items = soup.find_all('span', {'class': 'h3'})
    for item in all_items:
        print(item.get_text(strip=True))

使用硒

import selenium.webdriver

url = 'https://associatedrealtorsaruba.com/index.php?option=com_ezrealty&Itemid=11&task=results&cnid=0&custom7=&custom8=&parking=&type=0&cid=0&stid=0&locid=0&minprice=&maxprice=&minbed=&maxbed=&min_squarefeet=&max_squarefeet=&bathrooms=&sold=0&lug=0&featured=0&custom4=&custom5=&custom6=&postcode=&radius=&direction=DEFAULT&submit=Search'

driver = selenium.webdriver.Firefox()
driver.get(url)

while True:

    all_items = driver.find_elements_by_xpath('//span[@class="h3"]')
    for item in all_items:
        print(item.text)

    try:    
        # find link to next page
        all_items = driver.find_element_by_xpath('//a[@title="next page"]')

        # click link to load next page
        all_items.click()
    except Exception as ex:
        print('ex:', ex)
        break

当['href']元素是超链接时，如何提取href

1 个答案: