Question

我正在尝试从论坛中提取所有链接（https://www.pakwheels.com/forums/c/travel-n-tours）向下滚动一次后，我的剪贴板类停止了。

from bs4 import BeautifulSoup

sourceUrl='https://www.pakwheels.com/forums/c/travel-n-tours'

#----------------------------------Source of below code:http://stackoverflow.com/questions/32391303/how-to-scroll-to-the-end-of-the-page-using-selenium-in-python--------------------#
#----------------------- Scrolling to the bottom of page ----------------------------- ----------#

from selenium import webdriver
import time
chrome_path=r"C:\Users\Shani\Desktop\chromedriver.exe"
driver=webdriver.Chrome(chrome_path)
driver.get(sourceUrl)
updatedLenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
scrollComplete=False
while(scrollComplete==False):
        currentLenOfPage = updatedLenOfPage
        updatedLenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
        print('Scrolling down')
        time.sleep(5)
        if currentLenOfPage==updatedLenOfPage:
            scrollComplete=True
time.sleep(10)
pageSource=driver.page_source

# ------------------------------------- Getting links ---------------------------------- #
soup = BeautifulSoup(pageSource, 'lxml')
# print(soup)

blogUrls=[]
for url in soup.find_all('a'):
    if((url.get('href').find('/forums/t/')!=-1) and (url.get('href').find('about-the-travel-n-tours-category')==-1) and (url.get('href').find('/forums/t/topic/')==-1)):
        blogUrls.append(url.get('href'))
        print(url.get('href'))       
print(len(blogUrls))

它出现以下错误

Traceback (most recent call last):
  File "D:\LiclipsWorkSpace\NLKTLib\Scrapping\scrolling.py", line 32, in <module>
    if((url.get('href').find('/forums/t/')!=-1) and (url.get('href').find('about-the-travel-n-tours-category')==-1) and (url.get('href').find('/forums/t/topic/')==-1)):
AttributeError: 'NoneType' object has no attribute 'find'

请帮忙

Answer 1

您不需要Selenium，您可以从json响应中获取所有链接。此代码从前5页获取网址（用于将所有网页简单地更改为最后5页）。

import requests

for i in range(0, 5):
    r = requests.get(
        'https://www.pakwheels.com/forums/c/travel-n-tours/l/latest.json?page={}'.format(i)).json()
    topics = r['topic_list']['topics']
    for topic in topics:
        print ('https://www.pakwheels.com/forums/t/{}/{}'.format(topic['slug'], topic['id']))

Web Scraping，无法使用selenium web驱动程序向下滚动网页

1 个答案: