从BeautifulSoup Parsing获取特定值

时间:2017-02-20 18:01:48

标签: python parsing beautifulsoup

我最近开始学习更多有关Python以及如何使用BeautifulSoup解析网站的信息。

我现在面临的问题是我似乎陷入困境。

HTML代码(以汤为主):

<div class="mod-3-piece-app__visual-container__chart">
    <div class="mod-ui-chart--dynamic" data-chart-config='{"chartData":{"periods":[{"year":2013,"period":null,"periodicity":"A","icon":null},{"year":2014,"period":null,"periodicity":"A","icon":null},{"year":2015,"period":null,"periodicity":"A","icon":null},{"year":2016,"period":null,"periodicity":"A","icon":null},{"year":2017,"period":null,"periodicity":"A","icon":null},{"year":2018,"period":null,"periodicity":"A","icon":null}],"forecastRange":{"from":3.5,"to":5.5},"actualValues":[5.6785,6.45,9.22,8.31,null,null],"consensusData":[{"y":5.6307,"toolTipData":{"low":5.5742,"high":5.7142,"analysts":34,"restatement":null}},{"y":6.3434,"toolTipData":{"low":6.25,"high":6.5714,"analysts":35,"restatement":null}},{"y":9.1265,"toolTipData":{"low":9.02,"high":9.28,"analysts":40,"restatement":null}},{"y":8.2734,"toolTipData":{"low":8.17,"high":8.335,"analysts":40,"restatement":null}},{"y":8.9304,"toolTipData":{"low":8.53,"high":9.63,"analysts":41,"restatement":null}},{"y":10.1252,"toolTipData":{"low":8.63,"high":11.61,"analysts":42,"restatement":null}}]}}'>
        <noscript>
            <div class="mod-ui-chart--static">
                <div class="mod-ui-chart--sprited" style="width:410px; height:135px; background:url('/data/Charts/EquityForecast?issueID=36276&amp;height=135&amp;width=410') 0px -270px no-repeat;">
                </div>
            </div>
        </noscript>
    </div>
</div>

我的代码:

from bs4 import BeautifulSoup
import urllib.request


data = []
List = ['AAPL']

# Iterates Through List
for i in List :   
    # The webpage which we wish to Parse
    soup = BeautifulSoup(urllib.request.urlopen('https://markets.ft.com/data/equities/tearsheet/forecasts?s=AAPL:NSQ').read(), 'lxml')

    # Gathering the data
    Values = soup.find_all("div", {"class":"mod-3-piece-app__visual-container__chart"})[4]
    print(Values)

    # Getting desired values from data

我希望获得的是{"y" ....,之后的值,因此数字5.6307,6.3434,9.1265, 8.2734, 8.9304 and 10.1252,但我不能为我的生活弄清楚如何。我尝试了Values.get_text以及Values.text,但这只是空白(可能是因为所有代码都在列表中或类似内容中)。

如果我可以在“toolTipData”之后获取数据也可以。

有没有人介意帮助我?

如果我遗漏了任何内容,请提供反馈意见,以便我将来可以提出更好的问题。

谢谢

1 个答案:

答案 0 :(得分:1)

很快,您希望获得位于属性标记内的一些信息。

我所要做的就是:

  1. 打开网页来源,了解您的信息位于何处
  2. 使用find_all寻找合适的类属性mod-ui-chart--dynamic
  3. 对于使用find_all定位的每个元素,使用.get()
  4. 获取其属性内容
  5. 在属性内容字符串中搜索术语'actualValues'
  6. 如果找到'actualValues',则加载json并浏览其值。
  7. 尝试以下代码。我已经评论过了,所以你应该能够理解它在做什么。

    <强>代码:

    from bs4 import BeautifulSoup
    import urllib.request
    import json
    
    data = []
    List = ['AAPL']
    
    # Iterates Through List
    for i in List:   
        # The webpage which we wish to Parse
        soup = BeautifulSoup(urllib.request.urlopen('https://markets.ft.com/data/equities/tearsheet/forecasts?s=AAPL:NSQ').read(), 'lxml')
    
        # Gathering the data
        elemList = soup.find_all('div', {'class':'mod-ui-chart--dynamic'})
    
        #we will get the attribute info of each `data-chart-config` tag, inside each `div` with `class=mod-ui-chart--dynamic`
        for elem in elemList:
    
            elemID = elem.get('class')
            elemName = elem.get('data-chart-config')
    
            #if there's no value in elemName, pass...
            if elemName is None:
                pass
    
            #if the term 'actualValues' exists in elemName 
            elif 'actualValues' in elemName:
                #print('Extracting actualValues from:\n')
                #print("Attribute id = %s" % elemID)
                #print()
                #print("Attribute name = %s" % elemName)
                #print()
    
                #reading `data-chart-config` attribute as a json
                data = json.loads(elemName)
    
                #print(json.dumps(data, indent=4, sort_keys=True))
                #print(data['chartData']['actualValues'])
    
                #fetching desired info
                val1 = data['chartData']['actualValues'][0]
                val2 = data['chartData']['actualValues'][1]
                val3 = data['chartData']['actualValues'][2]
                val4 = data['chartData']['actualValues'][3]
    
                #printing desired values
                print(val1, val2, val3, val4)
    
                print('-'*15)
    

    <强>输出:

    1.9 1.42 1.67 3.36
    ---------------
    5.6785 6.45 9.22 8.31
    ---------------
    50557000000 42358000000 46852000000 78351000000
    ---------------
    170910000000 182795000000 233715000000 215639000000
    ---------------
    

    p.s.1:如果需要,您可以取消注释print()内的elif loop个功能,以了解该计划。

    p.s.2:如果需要,您可以将'actualValues'更改为val1 = data['chartData']['actualValues'][0] <{1}}