Web刮取带有Serp页面内容的论坛的多个页面

时间:2018-08-14 22:35:37

标签: python web-scraping beautifulsoup

我一直在苦苦挣扎,如何从具有Serp页面内容的论坛的多个页面获取链接列表。我的代码运行良好(我的目标是将搜索结果的所有对话都转存为pdf),但在线程的第一页之后无法正常工作。 当我对两个网址进行快速页面源比较时,我可以看到问题。第二个URL添加了“#serp = 2”并正确加载,但是页面来源与第一页面的链接相同。

这是我下面的代码。关于如何从后续页面中提取结果的任何建议,或者有什么方法可以一次提取所有结果?

#! python3
# getE2EResults.py - Opens all E2E threads and saves them to a file.

import requests, sys, webbrowser, bs4, pdfkit
from pypac import PACSession 
session = PACSession()
path_wkthmltopdf = r'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe'
config = pdfkit.configuration(wkhtmltopdf=path_wkthmltopdf)
site_list = []

print('Searching...') # display text while downloading 
res = session.get('http://e2e.ti.com/search?q=' + ''.join(sys.argv[1:]) + '&category=forum&date=&customdaterange=0&startdate=&enddate=')
res.raise_for_status()

# Retrieve top search result links.
soup = bs4.BeautifulSoup(res.text,'lxml')

# Find the number of pages in search results
mydivs = soup.findAll("div", {"class": "search-view-by-sort"})
string1 = mydivs[0].text
numberOfResults = [int(s) for s in string1.split() if s.isdigit()]
numberOfPages = (numberOfResults[0]//10)
if (numberOfResults[0]%10 > 0):
    numberOfPages += 1
print(str(numberOfPages) + ' pages of results')
###########################################

# Find all 10 post links for the first page, add to site list
linkElems = soup.select('.name a')
numOpen = min(10, len(linkElems))
for i in range(numOpen):
    res1 = session.get(linkElems[i].get('href'))
    res1.raise_for_status()
    site_list.append(linkElems[i].get('href'))
#   soup1 = bs4.BeautifulSoup(res1.text)
#   webbrowser.open(linkElems[i].get('href'))

# Repeat for all pages in search results
if (numberOfPages > 1):
    for n in range(2,(numberOfPages+1)):
        res = session.get('http://e2e.ti.com/search?q=' + ''.join(sys.argv[1:]) + '&category=forum&date=&customdaterange=0&startdate=&enddate=#serp='+str(n))
        #print('http://e2e.ti.com/search?q=' + ''.join(sys.argv[1:]) + '&category=forum&date=&customdaterange=0&startdate=&enddate=#serp='+str(n))
        res.raise_for_status()
        soup = bs4.BeautifulSoup(res.text,'lxml')
        linkElems = soup.select('.name a')
        numOpen = min(10, len(linkElems))
        for i in range(numOpen):
            res1 = session.get(linkElems[i].get('href'))
            res1.raise_for_status()
            site_list.append(linkElems[i].get('href'))

counter = 1
for item in site_list:
    print(str(counter) + ' ' + item)

'''         
# Create pdf of all Results
#print(site_list)
counter = 1
for item in site_list: 
  pdfkit.from_url(item, 'out'+str(counter)+'.pdf', configuration=config)
  counter += 1
#pdfkit.from_url(site_list, ''.join(sys.argv[1:])+'.pdf', configuration=config)
'''

1 个答案:

答案 0 :(得分:0)

最简单的方法是搜索下一页URL,并将其用于下一个请求。如果缺少该按钮,则说明您已到达最后一页:

from bs4 import BeautifulSoup
import requests

def get_page_urls(html):
    soup = BeautifulSoup(html, 'lxml')

    # Find the number of pages in search results
    number_of_pages = int(soup.find(class_='search-view-by-sort').span.text.split(' ')[2].replace(',', '')) // 10

    # Find the URL for the next page
    next_url = soup.find('a', class_='next')

    if next_url:    
        next_url = base_url + next_url['href']

    # Display/store all of the links
    for link in soup.select('.name a'):
        site_list.append(link['href'])
        print(' ', link['href'])

    return number_of_pages, next_url


site_list = []
page_number = 1
jar = requests.cookies.RequestsCookieJar()
base_url = 'http://e2e.ti.com'
search = 'Beaglebone black'
url = '{}/search?q={}&category=forum&date=&customdaterange=0&startdate=&enddate='.format(base_url, search)

print("Page 1")
res = requests.get(url, cookies=jar)
number_of_pages, url = get_page_urls(res.text)    

while url:    
    page_number += 1
    print("Page {} of {}".format(page_number, number_of_pages))
    res = requests.get(url, cookies=jar)
    _, url = get_page_urls(res.text)    

此代码会一直请求页面并存储URL,直到收到所有页面为止。请注意,为了进行测试,对搜索进行了硬编码。

这将为您提供如下结果:

Page 1
  http://e2e.ti.com/support/arm/sitara_arm/f/791/t/270719?tisearch=e2e-sitesearch&keymatch=Beaglebone black
  http://e2e.ti.com/support/embedded/linux/f/354/t/483988?tisearch=e2e-sitesearch&keymatch=Beaglebone black
  ..
  ..
Page 2 of 308
  http://e2e.ti.com/support/embedded/starterware/f/790/t/301790?tisearch=e2e-sitesearch&keymatch=Beaglebone black
  http://e2e.ti.com/support/arm/sitara_arm/f/791/t/501015?tisearch=e2e-sitesearch&keymatch=Beaglebone black
  ..
  ..
Page 3 of 308
  http://e2e.ti.com/support/embedded/starterware/f/790/p/285959/1050634?tisearch=e2e-sitesearch&keymatch=Beaglebone black#1050634
  ..
  ..