我一直在苦苦挣扎,如何从具有Serp页面内容的论坛的多个页面获取链接列表。我的代码运行良好(我的目标是将搜索结果的所有对话都转存为pdf),但在线程的第一页之后无法正常工作。 当我对两个网址进行快速页面源比较时,我可以看到问题。第二个URL添加了“#serp = 2”并正确加载,但是页面来源与第一页面的链接相同。
这是我下面的代码。关于如何从后续页面中提取结果的任何建议,或者有什么方法可以一次提取所有结果?
#! python3
# getE2EResults.py - Opens all E2E threads and saves them to a file.
import requests, sys, webbrowser, bs4, pdfkit
from pypac import PACSession
session = PACSession()
path_wkthmltopdf = r'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe'
config = pdfkit.configuration(wkhtmltopdf=path_wkthmltopdf)
site_list = []
print('Searching...') # display text while downloading
res = session.get('http://e2e.ti.com/search?q=' + ''.join(sys.argv[1:]) + '&category=forum&date=&customdaterange=0&startdate=&enddate=')
res.raise_for_status()
# Retrieve top search result links.
soup = bs4.BeautifulSoup(res.text,'lxml')
# Find the number of pages in search results
mydivs = soup.findAll("div", {"class": "search-view-by-sort"})
string1 = mydivs[0].text
numberOfResults = [int(s) for s in string1.split() if s.isdigit()]
numberOfPages = (numberOfResults[0]//10)
if (numberOfResults[0]%10 > 0):
numberOfPages += 1
print(str(numberOfPages) + ' pages of results')
###########################################
# Find all 10 post links for the first page, add to site list
linkElems = soup.select('.name a')
numOpen = min(10, len(linkElems))
for i in range(numOpen):
res1 = session.get(linkElems[i].get('href'))
res1.raise_for_status()
site_list.append(linkElems[i].get('href'))
# soup1 = bs4.BeautifulSoup(res1.text)
# webbrowser.open(linkElems[i].get('href'))
# Repeat for all pages in search results
if (numberOfPages > 1):
for n in range(2,(numberOfPages+1)):
res = session.get('http://e2e.ti.com/search?q=' + ''.join(sys.argv[1:]) + '&category=forum&date=&customdaterange=0&startdate=&enddate=#serp='+str(n))
#print('http://e2e.ti.com/search?q=' + ''.join(sys.argv[1:]) + '&category=forum&date=&customdaterange=0&startdate=&enddate=#serp='+str(n))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'lxml')
linkElems = soup.select('.name a')
numOpen = min(10, len(linkElems))
for i in range(numOpen):
res1 = session.get(linkElems[i].get('href'))
res1.raise_for_status()
site_list.append(linkElems[i].get('href'))
counter = 1
for item in site_list:
print(str(counter) + ' ' + item)
'''
# Create pdf of all Results
#print(site_list)
counter = 1
for item in site_list:
pdfkit.from_url(item, 'out'+str(counter)+'.pdf', configuration=config)
counter += 1
#pdfkit.from_url(site_list, ''.join(sys.argv[1:])+'.pdf', configuration=config)
'''
答案 0 :(得分:0)
最简单的方法是搜索下一页URL,并将其用于下一个请求。如果缺少该按钮,则说明您已到达最后一页:
from bs4 import BeautifulSoup
import requests
def get_page_urls(html):
soup = BeautifulSoup(html, 'lxml')
# Find the number of pages in search results
number_of_pages = int(soup.find(class_='search-view-by-sort').span.text.split(' ')[2].replace(',', '')) // 10
# Find the URL for the next page
next_url = soup.find('a', class_='next')
if next_url:
next_url = base_url + next_url['href']
# Display/store all of the links
for link in soup.select('.name a'):
site_list.append(link['href'])
print(' ', link['href'])
return number_of_pages, next_url
site_list = []
page_number = 1
jar = requests.cookies.RequestsCookieJar()
base_url = 'http://e2e.ti.com'
search = 'Beaglebone black'
url = '{}/search?q={}&category=forum&date=&customdaterange=0&startdate=&enddate='.format(base_url, search)
print("Page 1")
res = requests.get(url, cookies=jar)
number_of_pages, url = get_page_urls(res.text)
while url:
page_number += 1
print("Page {} of {}".format(page_number, number_of_pages))
res = requests.get(url, cookies=jar)
_, url = get_page_urls(res.text)
此代码会一直请求页面并存储URL,直到收到所有页面为止。请注意,为了进行测试,对搜索进行了硬编码。
这将为您提供如下结果:
Page 1
http://e2e.ti.com/support/arm/sitara_arm/f/791/t/270719?tisearch=e2e-sitesearch&keymatch=Beaglebone black
http://e2e.ti.com/support/embedded/linux/f/354/t/483988?tisearch=e2e-sitesearch&keymatch=Beaglebone black
..
..
Page 2 of 308
http://e2e.ti.com/support/embedded/starterware/f/790/t/301790?tisearch=e2e-sitesearch&keymatch=Beaglebone black
http://e2e.ti.com/support/arm/sitara_arm/f/791/t/501015?tisearch=e2e-sitesearch&keymatch=Beaglebone black
..
..
Page 3 of 308
http://e2e.ti.com/support/embedded/starterware/f/790/p/285959/1050634?tisearch=e2e-sitesearch&keymatch=Beaglebone black#1050634
..
..