网页的第二级动态抓取

时间:2018-06-25 09:09:25

标签: python web-scraping

我有一个作业,必须抓取网页的某些部分。给我的网页是BSE印度主页。我必须抓取“新闻与社交媒体中的热门公司”部分。

每个公司都有一个链接到一个弹出窗口,该弹出窗口包含一个图形和其他内容。我需要这张图和右边一列,以提供过去30天该公司的前10条推文。 The link shows the pop up for one of the companies

由于所有这些数据都是动态的,因此不会在HTML页面的源代码中明确显示。 https://www.bseindia.com/是网站链接,该部分是网页第二部分的左侧部分。我该怎么做?

我尝试遍历循环并通过单击公司行来打开每个链接。之后,我不了解如何获取数据。在检查页面时,我发现该框架是一个与第一个框架不同的iframe,因此我尝试对其进行更改。但这给了我一个例外,说没有找到框架。另外,我不了解在关闭弹出窗口后如何返回上一帧。

以下代码获取所有公司名称和百分比。我需要进入每个公司的弹出窗口,并根据我的要求获取所有信息。

from selenium import webdriver
import time 
import pandas as pd
from selenium.webdriver.common.keys import Keys

co = []
percentage = []
files = []

mydriver = webdriver.Chrome('C:/Program Files/chromedriver.exe')
mydriver.get("https://www.bseindia.com/")
time.sleep(10) # wait 5 seconds until DOM will load completely
iframe = mydriver.find_element_by_xpath("//iframe[@class = 'sentifi-widget-frame']") # locate iframe element
mydriver.switch_to.frame(iframe) # switch to the iframe

for count in range(1,11):
    co_name = mydriver.find_element_by_xpath("//div[@id = 'sf-widget-wrapper']/div/div/div/div/div/div[" + str(count) + "]/span/span[@class = 'sf-topic-name-text']") 
    co.append(co_name.text)

    co_percentage = mydriver.find_element_by_xpath("//div[@id = 'sf-widget-wrapper']/div/div/div/div/div/div[" + str(count) + "]/div/span/span[@class = 'sf-percent-number']") 
    percentage.append(co_percentage.text)

    file = "file_"+str(count)+".xlsx"
    files.append(file)

t = mydriver.find_element_by_xpath("//div[@id = 'sf-widget-wrapper']/div/div/div/div/div/span[@class = 'sf-updated-time']") 
t_t = [t.text]
for i in range(1,10):
    t_t.append("")

mydriver.switch_to_default_content()
mydriver.close()
mydriver.quit()

df = pd.DataFrame.from_dict({'Company Name':co, 'Percentage':percentage, 'Files': files, 'Update' : t_t})
df.to_excel('Trending Companies.xlsx', header=True, index=False) #print the data in the excel sheet. 

写在文件列表中的文件应包括每个公司的信息,即。图和推文列表。 The excel file after the 1st level scrape.

任何帮助将不胜感激。

1 个答案:

答案 0 :(得分:2)

以下代码将为您获取各个公司的链接:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
#chrome_options.add_argument("--headless")
chrome_options.add_argument("--start-maximized")

mydriver = webdriver.Chrome(executable_path='C:/Program Files/chromedriver.exe' , chrome_options=chrome_options)
mydriver.get("https://www.bseindia.com/")

WebDriverWait(mydriver, 30).until(EC.visibility_of_element_located((By.CLASS_NAME, "sentifi-widget-frame")))

iframe = mydriver.find_element_by_xpath("//iframe[@class = 'sentifi-widget-frame']") # locate iframe element
mydriver.switch_to.frame(iframe) # switch to the iframe

for count in range(1,11):

    selector = "#sf-widget-wrapper > div > div > div > div > div.sf-widget-content > div:nth-child(" + str(count) + ") > span.sf-topic-name > span"
    WebDriverWait(mydriver, 30).until(EC.visibility_of_element_located((By.CSS_SELECTOR, selector)))
    companyElement = mydriver.find_element_by_css_selector(selector)
    companyElement.click()

    mydriver.switch_to_default_content()
    newiFrame =  mydriver.find_element_by_css_selector("#SF-Screen-ranking-914SF2-EN")
    mydriver.switch_to.frame(newiFrame)

    WebDriverWait(mydriver, 30).until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".sf-header-anchor")))
    href = mydriver.find_element_by_css_selector(".sf-header-anchor").get_attribute("href")
    print("Company Link " + str(count) + " : " + href)

    closePopUp = mydriver.find_element_by_css_selector(".icon.sfin-close")
    closePopUp.click()

    mydriver.switch_to_default_content()
    mydriver.switch_to.frame(iframe)

mydriver.switch_to_default_content()
mydriver.close()
mydriver.quit()

样本输出:

Company Link 1 : https://sentifi.com/stocks/vikas-ecotech
Company Link 2 : https://sentifi.com/stocks/nbcc-india
Company Link 3 : https://sentifi.com/stocks/interglobe-aviation
Company Link 4 : https://sentifi.com/stocks/tata-motors-ltd
Company Link 5 : https://sentifi.com/stocks/iifl-holdings
Company Link 6 : https://sentifi.com/stocks/idbi-bank-ltd
Company Link 7 : https://sentifi.com/stocks/nhpc
Company Link 8 : https://sentifi.com/stocks/hdfc-ltd
Company Link 9 : https://sentifi.com/stocks/hindustan-copper
Company Link 10 : https://sentifi.com/stocks/krbl-ltd

其他:

如果您敏锐地观察XHR URL,那么此请求将为您提供各个公司的链接。 https://widgets.sentifi.com/boards?portfolioId=96330&period=lastweek&top=2&order=topicBuzzChange&language=en&eventStatisticLanguage=en&eventStatisticEnable=false

如果您将XHR URL中的“ top”参数从2修改为10,那么我们将获得趋势排名前10位的公司。

排名前2位的公司的JSON响应示例:

{
  "type": null,
  "data": [
    {
      "itemkey": 1798045,
      "name": "Vikas EcoTech",
      "shortName": "Vikas EcoTech",
      "listedCompany": true,
      "buzz": null,
      "channels": null,
      "avg30h": null,
      "change": 250,
      "urn": "/stocks/vikas-ecotech",
      "hasNewEvent": null,
      "isin": "INE806A01020"
    },
    {
      "itemkey": 25075,
      "name": "NBCC (India)",
      "shortName": "NBCC India",
      "listedCompany": true,
      "buzz": null,
      "channels": null,
      "avg30h": null,
      "change": 160,
      "urn": "/stocks/nbcc-india",
      "hasNewEvent": null,
      "isin": "INE095N01023"
    }
  ],
  "extra": {
    "updatedTime": "2018-06-25T12:11:15.776Z"
  },
  "error": 0,
  "message": null,
  "localization": null,
  "params": null,
  "pager": null
}