我正在尝试抓取 http://tickertrak.com/ 上的表格,但无法抓取。代码在 table 标签之后没有读取任何内容,我什至看不到它,所以我非常困惑。我是网络抓取的新手,到目前为止只能做维基百科表格。
import time
!pip install selenium
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
driver.get("http://tickertrak.com/")
time.sleep(2)
df = pd.read_html(driver.page_source, flavor="bs4")
df = pd.concat(df)
df.drop(index=0, axis=0, inplace=True)
df.to_csv("your_table.csv", index=False)`
我在行中遇到错误:
driver = webdriver.Chrome(options=options)
用于网络驱动程序异常。我是不是忘记了某处的路径?
答案 0 :(得分:0)
该表是由 JS
(JavaScript
) 动态生成的,因此您不会获得带有普通 requests
和 bs4
的数据。
但是,您可以尝试一下 selenium
并将其与 panadas
结合。
方法如下:
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
driver.get("http://tickertrak.com/")
time.sleep(2)
df = pd.read_html(driver.page_source, flavor="bs4")
df = pd.concat(df)
df.drop(index=0, axis=0, inplace=True)
df.to_csv("your_table.csv", index=False)
这会生成一个 .csv
文件,如下所示:
答案 1 :(得分:0)
这是使用请求模块从该网页获取数据的更快方法之一,因为数据已经在脚本标记内的页面源中。您现在要做的就是在将数据存储到数据框中之前清理数据。
import re
import requests
URL = 'http://tickertrak.com/'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
r = s.get(URL)
items = re.findall(r"var arrayFromPHP = \[(.*?)\];",r.text)[0]
trs = re.findall(r"\[(.*?)\]",items)
for tds in trs:
print(tds)
输出如下:
"Options","gamestop corp","gme","1","58662","131","-80","-85","1"
"Options","amc entertainment holdings inc","AMC","1","16290","36","-79","-66","2"
"Options","nokia corp","nok","1","3568","14","-86","-88","3"
"Options","regal-beloit corp","RBC","1","3254","11","-56","-89","4"
"Options","blackberry ltd","BB","1","3002","10","-91","-92","5"