BeautifulSoup:无法抓取内容

时间:2018-01-16 12:13:01

标签: python web-scraping beautifulsoup

我无法从本网站中提取内容。我尝试添加不同的标题,但我仍然无法从此网站中删除数据。

import requests
from bs4 import BeautifulSoup

seedURL = 'https://www.owler.com/location/new-york-companies?p=2'

# headers = requests.utils.default_headers()
# headers.update({
#     'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
# })
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Firefox/20.0'}

req_content = requests.get(seedURL, headers=headers)
data = BeautifulSoup(req_content.content,"lxml")
print(data)

这是我得到的回复

<!DOCTYPE html>
<html>
<head>
<meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
<meta content="max-age=0" http-equiv="cache-control"/>
<meta content="no-cache" http-equiv="cache-control"/>
<meta content="0" http-equiv="expires"/>
<meta content="Tue, 01 Jan 1980 1:00:00 GMT" http-equiv="expires"/>
<meta content="no-cache" http-equiv="pragma"/>
<meta content="10; url=/distil_r_captcha.html?requestId=c4aceb58-d5b5-480d-a09f-dafd9cca7cbe&amp;httpReferrer=%2Flocation%2Fnew-york-companies%3Fp%3D2" http-equiv="refresh"/>
<script type="text/javascript">
    (function(window){
        try {
            if (typeof sessionStorage !== 'undefined'){
                sessionStorage.setItem('distil_referrer', document.referrer);
            }
        } catch (e){}
    })(window);
</script>
<script defer="" src="/owlerdstl.js" type="text/javascript"></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#dqfubwvfuxfsxffus{display:none!important}</style></head>
<body>
<div id="distilIdentificationBlock"> </div>
</body>
</html>

1 个答案:

答案 0 :(得分:1)

试试这个。它应该让你获取你所追求的内容:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://www.owler.com/location/new-york-companies?p=2")

for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".company-details"))):
    company_name = item.find_element_by_id("company-name-1").text
    ceo_name = item.find_element_by_id("ceo-name-1").text
    print(company_name,ceo_name)

driver.quit()

部分输出:

Mercer Julio A. Portalatin
Thomson Reuters James C. Smith
Bloomberg, L.P. Michael R. Bloomberg
American Express Co Kenneth I. Chenault