无法从网站bseindia.com解析表数据beautifulsoup

时间:2018-06-29 19:25:55

标签: python csv beautifulsoup

例如,我无法解析以下链接中的数据:

https://www.bseindia.com/stock-share-price/avanti-feeds-ltd/avanti/512573/

我要从此网页填充高低表。我尝试了表和div的许多组合,但徒劳无功。下面是我的python beautifulsoup代码(BS4)

import csv
import requests
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup

f = open('bse.csv', 'w', newline = '')
writer = csv.writer(f)

with open("bselist.csv") as f:

    for row in csv.reader(f):

        for stock in row:

            url = "https://www.bseindia.com/stock-share-price/{}".format(stock)    
            soup = BeautifulSoup(urllib.request.urlopen(url).read(), "lxml")    
            mydivs = soup('div', {"class": "newscripcotent5"})[0].find_all('span')    
            writer.writerow([stock] + mydivs)
            print([stock] + mydivs)

为简单起见,URL我已直接链接到文件bselist.csv中包含的记录之一。我正在寻找div id为“ highlow”

它只是给我以下输出

avanti-feeds-ltd/avanti/512573/

没有我要寻找的桌子。

理想情况下,输出应类似于以下内容:

avanti-feeds-ltd/avanti/512573/ 52 Week High (adjusted) 999.00(13/11/2017)
avanti-feeds-ltd/avanti/512573/ 52 Week Low (adjusted)  410.26(05/06/2018)
avanti-feeds-ltd/avanti/512573/ 52 Week High (Unadjusted)   3,000.00(13/11/2017)
avanti-feeds-ltd/avanti/512573/ 52 Week Low (Unadjusted)    535.50(29/06/2018)
avanti-feeds-ltd/avanti/512573/ Month H/L   659.34/410.26
avanti-feeds-ltd/avanti/512573/ Week H/L    625.25/508.82

1 个答案:

答案 0 :(得分:0)

您尝试获取的信息似乎是使用javascript动态填充的,这可能就是为什么您无法获取它的原因。因此,为了解决这个问题,您可以使用selenium webdriver来获取动态内容。

这是代码的外观:

import csv
from bs4 import BeautifulSoup
from selenium import webdriver

output_file = open('bse.csv', 'w')

with open("bselist.csv") as f:
    for row in csv.reader(f):
        for stock in row:
            url = "https://www.bseindia.com/stock-share-price/{}".format(stock)
            driver = webdriver.Chrome('/path/to/chromedriver')
            driver.get(url)
            html = driver.page_source
            soup = BeautifulSoup(html, "html.parser")
            div = soup.find_all('div', {"class": "newscripcotent5"})[0]
            outer_table = div.find_all('table')[0]
            inner_table = outer_table.findChildren("table")[0]
            rows = inner_table.findChildren("tr")
            for row in rows:
                cols = row.findChildren("td")
                if len(cols) < 2:
                    continue
                output_file.write(stock + "," + cols[0].getText() + "," + cols[1].getText() + "\n")
                print(stock + " " + cols[0].getText() + " " + cols[1].getText())

f.close()

请确保将/path/to/chromedriver替换为chromedriver的适当路径。

因此,假设您的bselist.csv包含:

avanti-feeds-ltd/avanti/512573/

您将获得以下输出:

avanti-feeds-ltd/avanti/512573/ 52 Week High (adjusted) 999.00(13/11/2017)
avanti-feeds-ltd/avanti/512573/ 52 Week Low (adjusted) 410.26(05/06/2018)
avanti-feeds-ltd/avanti/512573/ 52 Week High (Unadjusted) 3,000.00(13/11/2017)
avanti-feeds-ltd/avanti/512573/ 52 Week Low (Unadjusted) 507.00(02/07/2018)
avanti-feeds-ltd/avanti/512573/ Month H/L 659.34/410.26
avanti-feeds-ltd/avanti/512573/ Week H/L 615.00/507.00

如果您还没有seleniumchromedriver,则需要先安装它。我在Mac OS上这样安装了这些程序:

sudo easy_install selenium
sudo easy_install chromedriver

您可能会发现以下帖子很有帮助: