Question

我试图以编程方式从6000个股票中收集一些数据，我使用的是Python 3.6 selenium webdriver Firefox。 [我打算使用BeautifulSoup来解析HTML，但似乎每当我更新网页时，链接都没有改变，汤不能应对Javascript]

无论如何，当我创建一个for循环来执行此操作时，我的代码share_price = driver.find_element_by_css_selector(".highcharts-root > g:nth-child(25) > text:nth-child(2)")中的特定行在大多数时间出错（虽然它工作了几次，所以我相信我的代码很好）。但是，如果我手动完成它（它可以复制并粘贴到Python IDLE并运行它），它可以正常工作。我尝试使用time.sleep(4)来允许web在我从背景中拯救任何东西之前加载，但似乎这不是解决方案。现在，我已经没有提示了。任何人都可以帮我解开这个。

以下是我的代码：

 from selenium import webdriver
 import time
 import pyautogui
 filename = "historical_price_marketcap.csv"
 f = open(filename,"w")
 headers = "stock_ticker, share_price, market_cap\n"
 f.write(headers)
 driver = webdriver.Firefox()
 def get_web():
     driver.get("https://stockrow.com")
 import csv
 with open("TICKER.csv") as file:
        read = csv.reader(file)
        TICKER=[]
        for row in read:
                ticker = row[0][1:-1]
                TICKER.append(ticker)
for Ticker in range(len(TICKER)):
    get_web()
    time.sleep(3)
    pyautogui.click(425, 337)
    pyautogui.typewrite(TICKER[Ticker],0.25)
    time.sleep(2)
    pyautogui.press("enter")
    time.sleep(2)
    pyautogui.click(268, 337)
    pyautogui.press("backspace")
    time.sleep(2)
    pyautogui.typewrite('Stock Price',0.25)
    time.sleep(2)
    pyautogui.press("enter")
    time.sleep(2)

    pyautogui.click(702, 427)
    for i in range(int(10)):
            pyautogui.press("backspace")
    time.sleep(2)
    pyautogui.typewrite("2013-12-01",0.25)
    pyautogui.press("enter")
    time.sleep(2)

    pyautogui.click(882, 425)
    for k in range(10):
            pyautogui.press("backspace")
    time.sleep(2)
    pyautogui.typewrite("2013-12-31",0.25)
    pyautogui.press("enter")
    time.sleep(2)

    pyautogui.click(1317, 318)
    for j in range(3):
            pyautogui.press("down")

    time.sleep(10)
    share_price = driver.find_element_by_css_selector(".highcharts-root > g:nth-child(25) > text:nth-child(2)")
    get_web()
    time.sleep(3)
    pyautogui.click(425, 337)
    pyautogui.typewrite(TICKER[Ticker],0.25)
    time.sleep(2)
    pyautogui.press("enter")
    time.sleep(2)
    pyautogui.click(268, 337)
    pyautogui.press("backspace")
    time.sleep(2)
    pyautogui.typewrite('Market Cap',0.25)
    time.sleep(2)
    pyautogui.press("enter")
    time.sleep(2)

    pyautogui.click(702, 427)
    for i in range(int(10)):
            pyautogui.press("backspace")
    time.sleep(2)
    pyautogui.typewrite("2013-12-01",0.25)
    pyautogui.press("enter")
    time.sleep(2)

    pyautogui.click(882, 425)
    for k in range(10):
            pyautogui.press("backspace")
    time.sleep(2)
    pyautogui.typewrite("2013-12-31",0.25)
    pyautogui.press("enter")
    time.sleep(2)

    pyautogui.click(1317, 318)
    for j in range(3):
            pyautogui.press("down")

    time.sleep(10)
    market_cap = driver.find_element_by_css_selector(".highcharts-root > g:nth-child(28) > text:nth-child(2)")
 f.close()

似乎是我的两条线是share_price = driver.find_element_by_css_selector(".highcharts-root > g:nth-child(25) > text:nth-child(2)")这是来自Python的错误信息：

 Traceback (most recent call last):
  File "C:\Users\HENGBIN\Desktop\get_historical_data.py", line 65, in <module>
    share_price = driver.find_element_by_css_selector(".highcharts-root > g:nth-child(25) > text:nth-child(2)")
  File "E:\Program Files\python3.6.1\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 457, in find_element_by_css_selector
    return self.find_element(by=By.CSS_SELECTOR, value=css_selector)
  File "E:\Program Files\python3.6.1\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 791, in find_element
    'value': value})['value']
  File "E:\Program Files\python3.6.1\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 256, in execute
    self.error_handler.check_response(response)
  File "E:\Program Files\python3.6.1\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 194, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: .highcharts-root > g:nth-child(25) > text:nth-child(2)

它在循环中大部分时间都不起作用，但如果我在Python IDLE中手动运行它可以正常工作。我不知道发生了什么.........

Answer 1

您的脚本中有几件事我不同。首先 - 试图摆脱pyautogui。 Selenium内置了点击功能（查看this SO-question）和发送各种密钥（查看this SO-question）。此外，当您更改浏览器中的内容（使用pyautogui）时，我的经验是，selenium并不总是会意识到这些更改。这可以解释你在找到用硒搜索它们时创建的元素pyautogui的问题。

其次：你的get_web（） - 函数可能会导致问题。一般而言，函数内部的内容必须返回 - 或声明为全局 - 才能在函数外部访问。打开网页的驱动程序是全局的（您在函数外部实例化它），但函数内部的url是本地的，这意味着您可能无法访问函数外部的内容。我建议您删除该功能（因为它除了打开网址之外什么都不做）并且只需在代码中替换函数调用，如下所示：

for Ticker in range(len(TICKER)):
    driver.get("https://stockrow.com")
    time.sleep(3)
    # insert keys, click and so on...

这应该可以让你使用seleniums driver.find_elements ...-方法。

第三：我假设您也希望从网站中提取一些数据。如果是这样，请使用除selenium之外的其他内容进行解析。 Selenium是一种缓慢的解析器。你可以试试BeautifulSoup。

加载网站后，您可以在BeautifulSoup中加载html并提取您想要的任何内容（SO-question here，然后会告诉您如何进行此操作）

from bs4 import BeautifulSoup
.....
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
element_you_want_to_retrieve = soup.find('tag_name', attrs={'key': 'value'})

但是对于这个网站，你真正应该做的是点击网站自己做的api调用。使用Chromes检查工具。您会看到它会查询三个您可以直接调用的API并避免整个selenium事件。

苹果网址如下：

url = 'https://stockrow.com/api/fundamentals.json?indicators[]=0&tickers[]=APPL'

因此，对于请求库，您可以将内容检索为json，如下所示：

import requests
from pprint import pprint
url = 'https://stockrow.com/api/fundamentals.json?indicators[]=0&tickers[]=AAPL'
response = requests.get(url).json()
pprint(response)

这是比硒更快的解决方案。

python webscraping在循环中失败但在我手动执行时有效

1 个答案: