如何使用python请求通过网络抓取过滤结果?

时间:2020-06-18 20:08:39

标签: python web-scraping beautifulsoup python-requests

我正在尝试从此网站https://www.gurufocus.com/insider/summary抓取经过过滤的结果。现在,我只能从第一页获取信息。但是我真正想做的是过滤多个行业并获得相关数据(您可以在过滤区域看到“行业”)。但是,当我选择行业时,Web URL不会更改,并且我无法直接从URL抓取。我看到有人说您可以使用requests.post来获取数据,但我真的不知道它是如何工作的。

这是我的一些代码。

TradeUrl = "https://www.gurufocus.com/insider/summary"
r = requests.get(TradeUrl)
data=r.content
soup=BeautifulSoup(data, 'html.parser')

ticker = []
for tk in soup.find_all('td',{'class': 'table-stock-info', 'data-column': 'Ticker'}):
    ticker.append(tk.text)

如果我只需要金融服务业的股票行情怎么办?

1 个答案:

答案 0 :(得分:0)

使用建议的post请求的问题是请求需要一个授权令牌,该令牌具有到期时间。如果您右键单击页面->选择Inspect->选择Network,然后选择Industry,然后单击POST请求,然后单击{ {1}}中有一个Cookie Cookies,其中有一个时间戳,指示何时不再使用授权。

但是,您可以使用硒从所有页面上抓取数据。

首先安装Selenium:

password_grant_custom.client.expires

然后获得一个驱动程序https://sites.google.com/a/chromium.org/chromedriver/downloads, 为您的Chrome版本选择合适的版本,然后从zip文件中提取出来。

注意在Windows上,您需要将chromedriver的路径添加到

`sudo pip3 install selenium` on Linux or `pip install selenium` on Windows

在Linux上,将chromedriver复制到driver = webdriver.Chrome(options=options)

/usr/local/bin/chromedriver

输出

from selenium import webdriver
from selenium.webdriver.common.by import By
import selenium.webdriver.support.expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
import time

# Start with the driver maximised to see the drop down menus properly
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
driver = webdriver.Chrome(options=options)
driver.get('https://www.gurufocus.com/insider/summary')

# Set the page size to 100 to reduce page loads
driver.find_element_by_xpath("//span[contains(text(),'40 / Page')]").click()
wait = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((
        By.XPATH,
        "//div[contains(text(),'100')]"))
)
element = driver.find_element_by_xpath("//div[contains(text(),'100')]").click()

# Wait for the page to load and don't overload the server
time.sleep(2)

# select Industry
driver.find_element_by_xpath("//span[contains(text(),'Industry')]").click()

# Select Financial Services
element = WebDriverWait(driver, 5).until(
    EC.presence_of_element_located((
        By.XPATH,
        "//span[contains(text(),'Financial Services')]"))
)
element.click()

ticker = []

while True:
    # Wait for the page to load and don't overload the server
    time.sleep(6)
    # Parse the HTML
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    for tk in soup.find_all('td', {'class': 'table-stock-info', 'data-column': 'Ticker'}):
        ticker.append(tk.text)
    try:
        # Move to the next page
        element = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CLASS_NAME, 'btn-next')))
        element.click()
    except TimeoutException as ex:
        # No more pages so break
        break
driver.quit()

print(len(ticker))
print(ticker)

已更新

如果您想从所有页面上抓取所有数据和/或写入csv,请使用pandas:

4604
['PUB   ', 'ARES   ', 'EIM   ', 'CZNC   ', 'SSB   ', 'CNA   ', 'TURN   ', 'FNF   ', 'EGIF   ', 'NWPP  etc...

已更新,以回应评论: 首先从https://github.com/mozilla/geckodriver/releases下载Firefox驱动程序geckodriver并解压缩驱动程序。再次在Windows上,您需要将geckodriver的路径添加到from selenium import webdriver from selenium.webdriver.common.by import By import selenium.webdriver.support.expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait from selenium.common.exceptions import TimeoutException import pandas as pd import time # Start with the driver maximised to see the drop down menus properly options = webdriver.ChromeOptions() options.add_argument("--start-maximized") driver = webdriver.Chrome(options=options) driver.get('https://www.gurufocus.com/insider/summary') # Set the page size to 100 to reduce page loads driver.find_element_by_xpath("//span[contains(text(),'40 / Page')]").click() wait = WebDriverWait(driver, 10).until( EC.presence_of_element_located(( By.XPATH, "//div[contains(text(),'100')]")) ) driver.find_element_by_xpath("//div[contains(text(),'100')]").click() # Wait for the page to load and don't overload the server time.sleep(2) # select Industry driver.find_element_by_xpath("//span[contains(text(),'Industry')]").click() # Select Financial Services element = WebDriverWait(driver, 5).until( EC.presence_of_element_located(( By.XPATH, "//span[contains(text(),'Financial Services')]")) ) element.click() columns = [ 'Ticker', 'Links', 'Company', 'Price1', 'Insider Name', 'Insider Position', 'Date', 'Buy/Sell', 'Insider Trading Shares', 'Shares Change', 'Price2', 'Cost(000)', 'Final Share', 'Price Change Since Insider Trade (%)', 'Dividend Yield %', 'PE Ratio', 'Market Cap ($M)', 'None' ] df = pd.DataFrame(columns=columns) while True: # Wait for the page to load and don't overload the server time.sleep(6) # Parse the HTML df = df.append(pd.read_html(driver.page_source, attrs={'class': 'data-table'})[0], ignore_index=True) try: # Move to the next page element = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CLASS_NAME, 'btn-next'))) element.click() except TimeoutException as ex: # No more pages so break break driver.quit() # Write to csv df.to_csv("Financial_Services.csv", encoding='utf-8', index=False) 或在Linux上将geckodriver复制到/ usr / local / bin / geckodriver

driver = webdriver.Firefox()