我正在尝试从一个股票网站中抓取数据,但是问题是该表的内容是隐藏的。该网站为http://www.moneycontrol.com/stocks/histstock.php
1.Select Index
2.Select S&P BSE MIDCAP
3.Filter data from Jan 2019 to Jan 2020 to get to the final page
4.I want to scrape the table contents of this page
这就是我尝试使用汤的方式
import requests
from bs4 import BeautifulSoup
link='http://www.moneycontrol.com/stocks/hist_index_result.php?indian_indices=25'
html=requests.get(link)
html.status_code #200
raw=html.content
soup=BeautifulSoup(raw,'html.parser') #have tried with xml and html5lib
soup.find_all('table',{'class':'tblchart'})
#output
[<table border="0" cellpadding="0" cellspacing="0" class="tblchart">
</table>]
我也尝试过使用硒,但是结果是一样的。
我很难获取信息。
任何向正确方向提出的建议,答案或推动都会受到赞赏。
答案 0 :(得分:1)
仅使用BeautifulSoup
的解决方案。数据是通过Ajax动态加载的,但是您可以仅使用requests
模块来模拟请求:
import requests
from bs4 import BeautifulSoup
data = {
'mth_frm_mth':'01',
'mth_frm_yr':'2019',
'mth_to_mth':'01',
'mth_to_yr':'2020',
'hdn':'monthly'
}
url = 'https://www.moneycontrol.com/stocks/hist_index_result.php?indian_indices=26'
soup = BeautifulSoup(requests.post(url, data=data).content, 'html.parser')
all_data = []
for tr in soup.select('.tblchart tr:has(td)'):
tds = [td.get_text(strip=True) for td in tr.select('td')]
all_data.append(tds)
# print on screen
print('{:<15}{:<15}{:<15}{:<15}{:<15}'.format('Date', 'Open', 'High', 'Low', 'Close'))
for row in all_data:
print('{:<15}{:<15}{:<15}{:<15}{:<15}'.format(*row))
打印:
Date Open High Low Close
Jan 2020 13720.24 14946.21 13686.28 14667.96
Dec 2019 13584.07 13716.74 13103.54 13699.37
Nov 2019 13598.71 13729.32 13310.46 13560.57
Oct 2019 13190.78 13583.13 12669.63 13558.05
Sep 2019 12536.96 13648.30 12321.25 13170.76
Aug 2019 12698.94 12755.07 11950.86 12534.70
July 2019 14275.76 14375.47 12492.30 12692.18
June 2019 14882.18 15022.09 13803.07 14239.33
May 2019 14653.64 15039.53 13693.41 14867.04
Apr 2019 15069.13 15229.85 14585.92 14624.56
Mar 2019 13719.93 15034.53 13719.80 15027.36
Feb 2019 13961.93 14064.51 13099.46 13689.84
Jan 2019 14724.03 14790.99 13652.03 13926.22
答案 1 :(得分:0)
好的,我实际上是使用硒解决了这个问题,我不得不更新硒包,并且它像一种魅力一样工作。
这是我的做法:
import pandas as pd
from selenium import webdriver
link='http://www.moneycontrol.com/stocks/histstock.php'
driver=webdriver.Chrome()
driver.get(link)
#selecting the index in Step 1
driver.find_element_by_xpath('//*[@id="wutabs2"]').click()
#Selecting from the dropdown Index options in step 2
drop=driver.find_element_by_xpath('//*[@id="indian_indices"]')
drop.click()
drop.send_keys('S&P BSE MIDCAP')
#select the month in step 3
month=driver.find_element_by_xpath('/html/body/div[3]/div[3]/div/div[7]/div[2]/div[6]/table/tbody/tr/td[3]/form/div[2]/select[2]')
month.click()
month.send_keys('2019')
#click on search
driver.find_element_by_xpath('/html/body/div[3]/div[3]/div/div[7]/div[2]/div[6]/table/tbody/tr/td[3]/form/div[4]/input[1]').click()
#getting the contents
for i in driver.find_elements_by_css_selector('table.tblchart'):
a=i.text
a=a.split('\n')
#storing it as a data frame
df=pd.DataFrame(a)
#removing the first column as it contained table headers
df.drop(df.iloc[0:1,:],inplace=True)
#splitting the columns using space and storing them seperately
df['Month']=df[0].str.split(' ', expand=True)[0]
df['Year']=df[0].str.split(' ', expand=True)[1]
df['Open']=df[0].str.split(' ', expand=True)[2]
df['High']=df[0].str.split(' ', expand=True)[3]
df['Low']=df[0].str.split(' ', expand=True)[4]
df['Close']=df[0].str.split(' ', expand=True)[5]