如何对受密码保护的网站进行网络抓取

时间:2020-03-13 11:24:14

标签: python selenium web-scraping

我有一个网站,我需要从该网站上抓取一些数据(该网站为https://www.merriam-webster.com/,我要抓取保存的单词)。

该网站受密码保护,我还认为有一些我不明白的JavaScript内容(我认为某些元素是由浏览器加载的,因为当我获取html时它们不会显示)。

我目前有一个使用硒的解决方案,它确实可以工作,但是它需要打开firefox,我真的很希望有一个解决方案,我可以让它作为后台仅作为控制台的程序运行。

如果可以使用pythons请求库和最少的附加第三方库,我将如何归档?

这是我的硒解决方案的代码:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time
import json

# Create new driver
browser = webdriver.Firefox()
browser.get('https://www.merriam-webster.com/login')

# Find fields for email and password
username = browser.find_element_by_id("ul-email")
password = browser.find_element_by_id('ul-password')
# Find button to login
send = browser.find_element_by_id('ul-login')
# Send username and password 
username.send_keys("username")
password.send_keys("password")

# Wait for accept cookies button to appear and click it
WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.CLASS_NAME, "accept-cookies-button"))).click()
# Click the login button
send.click()

# Find button to go to saved words
WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.CLASS_NAME, "ul-favorites"))).click()


words = {}
# Now logged in
# Loop over pages of saved words
for i in range(2):
    print("Now on page " + str(i+1))
    # Find next page button
    nextpage = browser.find_element_by_class_name("ul-page-next")
    # Wait for the next page button to be clickable
    WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.CLASS_NAME, "ul-page-next")))

    # Find all the words on the page
    for word in browser.find_elements_by_class_name('item-headword'):
        # Add the href to the dictonary
        words[word.get_attribute("innerHTML")] = word.get_attribute("href")
    # Naivgate to the next page
    nextpage.click()

browser.close()

# Print the words list
with open("output.json", "w", encoding="utf-8") as file:
    file.write(json.dumps(words, indent=4))

1 个答案:

答案 0 :(得分:1)

如果要使用scale_fill_manual模块,则需要使用会话。

要初始化会话,请执行以下操作:

ggplot(p, aes(x=n.fjernet,y=os.neck)) + 
  geom_point(aes(color=uiccc),shape=20, size=5,alpha=0.7, show.legend = FALSE) + 
  geom_quantile(aes(colour = "50th percentile"), quantiles = 0.5, size=1,linetype=2) + 
  facet_wrap(.~factor(uiccc)) +
  #scale_fill_manual(values=cols) + 
  scale_colour_manual(values=cols, breaks = c("50th percentile"), name = "")  +
  scale_x_continuous(breaks = seq(0,50, by=10), name="Lymph nodal yield") +
  scale_y_continuous(name="Time to death (months)") +
  theme(strip.background = element_blank(),
        strip.text = element_text(color = "transparent"),
        axis.title.x = element_text(color = "grey20", size = 14, face="bold", margin=ggplot2::margin(t=10)),
        axis.title.y = element_text(color = "grey20", size = 14, face="bold", margin=ggplot2::margin(r=10)),
        legend.position="top",
        legend.text=element_text(size=rel(2)),
        legend.key.size = unit(2, "cm"),
        plot.margin = unit(c(1,3,1,1), "lines")) +
  coord_cartesian(clip = "off",ylim = c(0,175)) +
  geom_text(data = . %>% distinct(uiccc), 
            aes(label = factor(uiccc), color = uiccc), y = 190, x = 30, hjust = 0.5, fontface = "bold",cex=5, show.legend = FALSE)

然后您需要一个带有用户名和密码的有效负载

requests

然后登录即可:

session_requests = requests.session()

现在您的会话应该已登录,因此要使用同一会话转到任何其他密码保护页面:

payload = {
    "username":<USERNAME>,
    "password":<PASSWORD>}

然后,您可以使用result = session_requests.post( login_url, data = payload, headers = dict(referer=login_url) ) 查看该页面的内容。

编辑,如果您的站点包含CSRF令牌,则需要将其包含在“有效载荷”中。要获取CSRF令牌,请将“有效载荷”部分替换为:

result = session_requests.get(
    url, 
    headers = dict(referer = url)
)