如何使用硒获得链接并如何使用beautifulsoup进行刮擦?

时间:2019-06-14 02:39:03

标签: selenium-webdriver web-scraping beautifulsoup

我想从该特定网站收集文章。我只是在更早的时候使用Beautifulsoup,但它没有抓住链接。所以我尝试使用硒。现在,我尝试编写此代码。这给出了输出“无”。我以前从未使用过硒,所以对此我不太了解。我应该在此代码中进行哪些更改以使其正常工作并给出所需的结果?

import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait

base = 'https://metro.co.uk'
url = 'https://metro.co.uk/search/#gsc.tab=0&gsc.q=cybersecurity&gsc.sort=date&gsc.page=7'

browser = webdriver.Safari(executable_path='/usr/bin/safaridriver')
wait = WebDriverWait(browser, 10)
browser.get(url)

link = browser.find_elements_by_class_name('gs-title')
for links in link:
    links.get_attribute('href')
    soup = BeautifulSoup(browser.page_source, 'lxml')
    date = soup.find('span', {'class': 'post-date'})
    title = soup.find('h1', {'class':'headline'})
    content = soup.find('div',{'class':'article-body'})
    print(date)
    print(title)
    print(content)

    time.sleep(3)
browser.close()

我想从此页面以及第7至18页等其他页面上的所有文章中收集日期,标题和内容。

谢谢。

2 个答案:

答案 0 :(得分:1)

我没有使用Selenium来获取锚点,而是尝试首先在Selenium的帮助下提取页面源,然后在其上使用Beautiful Soup。

因此,要正确看待它:

import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait

base = 'https://metro.co.uk'
url = 'https://metro.co.uk/search/#gsc.tab=0&gsc.q=cybersecurity&gsc.sort=date&gsc.page=7'

browser = webdriver.Safari(executable_path='/usr/bin/safaridriver')
#wait = WebDriverWait(browser, 10) #Not actually required
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'html.parser') #Get the Page Source
anchors = soup.find_all("a", class_ = "gs-title") #Now find the anchors

for anchor in anchors:
    browser.get(anchor['href']) #Connect to the News Link, and extract it's Page Source
    sub_soup = BeautifulSoup(browser.page_source, 'html.parser')
    date = sub_soup.find('span', {'class': 'post-date'})
    title = sub_soup.find('h1', {'class':'post-title'}) #Note that the class attribute for the heading is 'post-title' and not 'headline'
    content = sub_soup.find('div',{'class':'article-body'})
    print([date.string, title.string, content.string])

    #time.sleep(3) #Even this I don't believe is required
browser.close()

通过此修改,我相信您可以获取所需的内容。

答案 1 :(得分:0)

您可以使用与页面使用相同的API。更改参数以获取所有结果页面

import requests
import json
import re

r = requests.get('https://cse.google.com/cse/element/v1?rsz=filtered_cse&num=10&hl=en&source=gcsc&gss=.uk&start=60&cselibv=5d7bf4891789cfae&cx=012545676297898659090:wk87ya_pczq&q=cybersecurity&safe=off&cse_tok=AKaTTZjKIBzl-5fANH8dQ8f78cv2:1560500563340&filter=0&sort=date&exp=csqr,4229469&callback=google.search.cse.api3732')
p = re.compile(r'api3732\((.*)\);', re.DOTALL)
data = json.loads(p.findall(r.text)[0])
links = [item['clicktrackUrl'] for item in data['results']]
print(links)