Web报废LinkedIn并没有给我html ....我在做什么错?

时间:2019-04-19 13:39:27

标签: python html selenium web-scraping beautifulsoup

因此,我正在尝试在LinkedIn的About页面上进行爬网,以获取某些公司的“专业”。当尝试用漂亮的汤刮LinkedIn时,它给我一个拒绝访问错误,所以我使用标题伪造我的浏览器。但是,它给出的是该输出,而不是相应的HTML:

  

\ n \ nwindow.onload = function(){\ n //从cookie中解析跟踪代码。\ n var trk =“ bf”; \ n var trkInfo =“ bf”; \ n var cookies = document。 cookie.split(“;”); \ n for(var i = 0; i 8)){\ n trk = cookies [i] .substring(8); \ n} \ n否则(((cookies [i] .indexOf(“ trkInfo =”)== 0)&&(cookies [i] .length> 8)){\ n trkInfo = cookies [i] .substring(8); \ n} \ n} \ n \ n如果(window.location.protocol ==“ http :“){\ n //如果设置了” sl“ cookie,请针对(var i = 0; i 3)){\ n window.location.href =“ https:” + window.location.href.substring(window.location.protocol .length); \ n return; \ n} \ n} \ n} \ n \ n //获取新域。对于\ n // fr.linkedin.com等国际域,我们将其转换为www.linkedin.com \ n var domain =“ www.linkedin.com”; \ n if(domain!= location.host){\ n var subdomainIndex = location.host.indexOf(“。linkedin”); \ n如果(subdomainIndex!= -1){\ n domain =“ www” + location.host.substring(subdomainIndex); \ n} \ n} \ n \ n window.location.href =“ https://” +域+“ / authwall?trk =” + trk +“&trkInfo =” + trkInfo + \ n“&originalReferer =” + document.referrer.substr(0 ,200)+ \ n“&sessionRedirect =” + encodeURIComponent(window.location.href); \ n} \ n \ n'

import requests
from bs4 import BeautifulSoup as BS


url = 'https://www.linkedin.com/company/biotech/'
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; 
rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": 
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
"Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", 
"DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}

response = requests.get(url, headers=headers)
print(response.content) 

我在做什么错?我认为它正在尝试检查Cookie。有什么方法可以将其添加到我的代码中吗?

3 个答案:

答案 0 :(得分:2)

LinkedIn实际上正在执行一些有趣的Cookie设置和随后的重定向,这将阻止您的代码按原样工作。通过检查在您的初始请求后返回的JavaScript,这很明显。基本上,HTTP cookie是由Web服务器设置的,用于跟踪信息,而这些cookie是在最终重定向发生之前由您遇到的JavaScript解析的。如果对JavaScript进行反向工程,您会发现最终的重定向是这样的(至少对我而言,基于我的位置和跟踪信息):

url = 'https://www.linkedin.com/authwall?trk=bf&trkInfo=bf&originalReferer=&sessionRedirect=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fbiotech%2F'

此外,您还可以使用Python的请求模块为您维护会话,该会话将自动管理HTTP标头(例如Cookie),因此您不必担心。以下内容应为您提供所需的HTML源代码。我将把它留给您来实现BeautifulSoup并解析您的需求。

import requests
from bs4 import BeautifulSoup as BS

url = 'https://www.linkedin.com/authwall?trk=bf&trkInfo=bf&originalReferer=&sessionRedirect=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fbiotech%2F'


with requests.Session() as s:
        response = s.get(url)
        print(response.content) 

答案 1 :(得分:0)

您可以使用Selenium来获取具有动态JS内容的页面。您还必须登录,因为要检索的页面需要身份验证。所以:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

EMAIL = ''
PASSWORD = ''

driver = webdriver.Chrome()
driver.get('https://www.linkedin.com/company/biotech/')
el = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'form-toggle')))
driver.execute_script("arguments[0].click();", el)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'login-email'))).send_keys(EMAIL)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'login-password'))).send_keys(PASSWORD)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'login-submit'))).click()
text = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@id="ember71"]/dl/dd[4]'))).text

输出:

Distributing medical products

答案 2 :(得分:-1)

您需要先对响应进行美化。

page_content = BeautifulSoup(page_response.content, "html.parser")
#we use the html parser to parse the url content and store it in a variable.
textContent = []
for i in range(0, 20):
    paragraphs = page_content.find_all("p")[i].text
    textContent.append(paragraphs)
# In my use case, I want to store the speech data I mentioned earlier.  so in this example, I loop through the paragraphs, and push them into an array so that I can manipulate and do fun stuff with the data.

不是我的示例,但是可以在这里找到 https://codeburst.io/web-scraping-101-with-python-beautiful-soup-bb617be1f486