我有一个作业,必须抓取网页的某些部分。给我的网页是BSE印度主页。我必须抓取“新闻与社交媒体中的热门公司”部分。
每个公司都有一个链接到一个弹出窗口,该弹出窗口包含一个图形和其他内容。我需要这张图和右边一列,以提供过去30天该公司的前10条推文。 The link shows the pop up for one of the companies
由于所有这些数据都是动态的,因此不会在HTML页面的源代码中明确显示。 https://www.bseindia.com/是网站链接,该部分是网页第二部分的左侧部分。我该怎么做?
我尝试遍历循环并通过单击公司行来打开每个链接。之后,我不了解如何获取数据。在检查页面时,我发现该框架是一个与第一个框架不同的iframe,因此我尝试对其进行更改。但这给了我一个例外,说没有找到框架。另外,我不了解在关闭弹出窗口后如何返回上一帧。
以下代码获取所有公司名称和百分比。我需要进入每个公司的弹出窗口,并根据我的要求获取所有信息。
from selenium import webdriver
import time
import pandas as pd
from selenium.webdriver.common.keys import Keys
co = []
percentage = []
files = []
mydriver = webdriver.Chrome('C:/Program Files/chromedriver.exe')
mydriver.get("https://www.bseindia.com/")
time.sleep(10) # wait 5 seconds until DOM will load completely
iframe = mydriver.find_element_by_xpath("//iframe[@class = 'sentifi-widget-frame']") # locate iframe element
mydriver.switch_to.frame(iframe) # switch to the iframe
for count in range(1,11):
co_name = mydriver.find_element_by_xpath("//div[@id = 'sf-widget-wrapper']/div/div/div/div/div/div[" + str(count) + "]/span/span[@class = 'sf-topic-name-text']")
co.append(co_name.text)
co_percentage = mydriver.find_element_by_xpath("//div[@id = 'sf-widget-wrapper']/div/div/div/div/div/div[" + str(count) + "]/div/span/span[@class = 'sf-percent-number']")
percentage.append(co_percentage.text)
file = "file_"+str(count)+".xlsx"
files.append(file)
t = mydriver.find_element_by_xpath("//div[@id = 'sf-widget-wrapper']/div/div/div/div/div/span[@class = 'sf-updated-time']")
t_t = [t.text]
for i in range(1,10):
t_t.append("")
mydriver.switch_to_default_content()
mydriver.close()
mydriver.quit()
df = pd.DataFrame.from_dict({'Company Name':co, 'Percentage':percentage, 'Files': files, 'Update' : t_t})
df.to_excel('Trending Companies.xlsx', header=True, index=False) #print the data in the excel sheet.
写在文件列表中的文件应包括每个公司的信息,即。图和推文列表。 The excel file after the 1st level scrape.
任何帮助将不胜感激。
答案 0 :(得分:2)
以下代码将为您获取各个公司的链接:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
#chrome_options.add_argument("--headless")
chrome_options.add_argument("--start-maximized")
mydriver = webdriver.Chrome(executable_path='C:/Program Files/chromedriver.exe' , chrome_options=chrome_options)
mydriver.get("https://www.bseindia.com/")
WebDriverWait(mydriver, 30).until(EC.visibility_of_element_located((By.CLASS_NAME, "sentifi-widget-frame")))
iframe = mydriver.find_element_by_xpath("//iframe[@class = 'sentifi-widget-frame']") # locate iframe element
mydriver.switch_to.frame(iframe) # switch to the iframe
for count in range(1,11):
selector = "#sf-widget-wrapper > div > div > div > div > div.sf-widget-content > div:nth-child(" + str(count) + ") > span.sf-topic-name > span"
WebDriverWait(mydriver, 30).until(EC.visibility_of_element_located((By.CSS_SELECTOR, selector)))
companyElement = mydriver.find_element_by_css_selector(selector)
companyElement.click()
mydriver.switch_to_default_content()
newiFrame = mydriver.find_element_by_css_selector("#SF-Screen-ranking-914SF2-EN")
mydriver.switch_to.frame(newiFrame)
WebDriverWait(mydriver, 30).until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".sf-header-anchor")))
href = mydriver.find_element_by_css_selector(".sf-header-anchor").get_attribute("href")
print("Company Link " + str(count) + " : " + href)
closePopUp = mydriver.find_element_by_css_selector(".icon.sfin-close")
closePopUp.click()
mydriver.switch_to_default_content()
mydriver.switch_to.frame(iframe)
mydriver.switch_to_default_content()
mydriver.close()
mydriver.quit()
样本输出:
Company Link 1 : https://sentifi.com/stocks/vikas-ecotech
Company Link 2 : https://sentifi.com/stocks/nbcc-india
Company Link 3 : https://sentifi.com/stocks/interglobe-aviation
Company Link 4 : https://sentifi.com/stocks/tata-motors-ltd
Company Link 5 : https://sentifi.com/stocks/iifl-holdings
Company Link 6 : https://sentifi.com/stocks/idbi-bank-ltd
Company Link 7 : https://sentifi.com/stocks/nhpc
Company Link 8 : https://sentifi.com/stocks/hdfc-ltd
Company Link 9 : https://sentifi.com/stocks/hindustan-copper
Company Link 10 : https://sentifi.com/stocks/krbl-ltd
其他:
如果您敏锐地观察XHR URL,那么此请求将为您提供各个公司的链接。 https://widgets.sentifi.com/boards?portfolioId=96330&period=lastweek&top=2&order=topicBuzzChange&language=en&eventStatisticLanguage=en&eventStatisticEnable=false
如果您将XHR URL中的“ top”参数从2修改为10,那么我们将获得趋势排名前10位的公司。
排名前2位的公司的JSON响应示例:
{
"type": null,
"data": [
{
"itemkey": 1798045,
"name": "Vikas EcoTech",
"shortName": "Vikas EcoTech",
"listedCompany": true,
"buzz": null,
"channels": null,
"avg30h": null,
"change": 250,
"urn": "/stocks/vikas-ecotech",
"hasNewEvent": null,
"isin": "INE806A01020"
},
{
"itemkey": 25075,
"name": "NBCC (India)",
"shortName": "NBCC India",
"listedCompany": true,
"buzz": null,
"channels": null,
"avg30h": null,
"change": 160,
"urn": "/stocks/nbcc-india",
"hasNewEvent": null,
"isin": "INE095N01023"
}
],
"extra": {
"updatedTime": "2018-06-25T12:11:15.776Z"
},
"error": 0,
"message": null,
"localization": null,
"params": null,
"pager": null
}