如何使用bs4访问脚本标签中的内容

时间:2019-03-06 12:25:57

标签: python-3.x beautifulsoup

我是python的新手,我试图使用漂亮的汤在具有dataLayer的页面上查找脚本标签,然后检索postNo的值并打印出来。

  <head>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
    <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.4.0/js/bootstrap.min.js"></script>

<!-- Data Layer - Begin -->
<script>
  dataLayer = [
    {
      'country': 'UnitedKingdom',
      'site': 'Blog',
      'postNo': '34',
      'pageType': 'Home',
      'pageType2': 'Blog',
      'pageType3': 'Top Tips'
    }
  ];
</script>
<!-- Data Layer - End -->
  </head>

任何帮助或指针将不胜感激。 谢谢

2 个答案:

答案 0 :(得分:1)

import requests
import bs4
import json




html = '''
  <head>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
    <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.4.0/js/bootstrap.min.js"></script>

<!-- Data Layer - Begin -->
<script>
  dataLayer = [
    {
      'country': 'UnitedKingdom',
      'site': 'Blog',
      'postNo': '34',
      'pageType': 'Home',
      'pageType2': 'Blog',
      'pageType3': 'Top Tips'
    }
  ];
</script>
<!-- Data Layer - End -->
  </head>'''


soup = bs4.BeautifulSoup(html, 'html.parser')

scripts = soup.find_all('script')
for script in scripts:
    if 'dataLayer = ' in script.text:

        jsonStr = script.text.strip()
        jsonStr = jsonStr.split('[')[1].strip()
        jsonStr = jsonStr.split(']')[0].strip()
        jsonStr = jsonStr.replace("'", '"')

        jsonObj = json.loads(jsonStr)

print (jsonObj['postNo'])

输出:

print (jsonObj['postNo'])
34

答案 1 :(得分:0)

只需从html中提取列表并解析,就很简单。请参见下面的代码。

from bs4 import BeautifulSoup
import ast
html = '''
<head>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
    <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.4.0/js/bootstrap.min.js"></script>

<!-- Data Layer - Begin -->
<script>
  dataLayer = [
    {
      'country': 'UnitedKingdom',
      'site': 'Blog',
      'postNo': '34',
      'pageType': 'Home',
      'pageType2': 'Blog',
      'pageType3': 'Top Tips'
    }
  ];
</script>
<!-- Data Layer - End -->
  </head>'''

soup = BeautifulSoup(html, 'html.parser')
content = soup.findAll('script')[2].text.replace(';','').replace('dataLayer = ','').strip()
data = ast.literal_eval(content)
print([x['postNo'] for x in data])