Python3 Web报废和解析Json

时间:2019-01-22 15:54:51

标签: python-3.x beautifulsoup

Python编程新手。 尝试从以下内容中抓取信息:

<script type="text/javascript"> dataLayer = [{'user.IntExt': 'External','user.UserId': '', 'app.Page': 'stores.aspx','app.siteArea': 'YPO-HM','app.Version': 'TBD','acct.storeAccount': '200315','acct.storeState': 'AL','acct.storeChain': 'TBD','acct.chainName': 'TBD','acct.NCPDP': '0140044','acct.StoreSegment': 'TBD','acct.storeId': 2068,'acct.storeName': 'Athens Pharmacy','acct.storeZipCode': '35611','acct.storeRegion': 'SOUTH','acct.storeGAUAID': '',}];(function(w, d, s, l, i){w[l] = w[l] ||[]; w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f = d.getElementsByTagName(s)[0],j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : ''; j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j, f);})(window, document,'script','dataLayer','GTM-NW87TKH');
  </script>

import requests
import urllib.request
import urllib
from bs4 import BeautifulSoup
from csv import writer
import csv
import json
import re


url = 'https://stores.healthmart.com/athenspharmacy/stores.aspx'
response = requests.get(url)
soupdata = BeautifulSoup(response.text,'html.parser')

data = soupdata.find('script')
p = re.compile('var dataLayer = (.*?);')
groups = dict(re.findall(p, data.text))
#json_data = json.dumps(groups)
print (groups['acct.NCPDP'], groups['acct.storeId'])

任何人都可以协助提供所需的代码吗? 我希望能够从dataLayer中获取任何信息。 这是源站点: https://stores.healthmart.com/athenspharmacy/stores.aspx

1 个答案:

答案 0 :(得分:0)

执行data = soupdata.find('script')时,它将返回找到的第一个脚本标签。您需要执行find_all,然后遍历这些元素以取出您要寻找的元素。然后,它需要操纵字符串,使其采用可以使用json.loads()的格式。

import requests
import urllib.request
import urllib
from bs4 import BeautifulSoup
from csv import writer
import csv
import json
import re


url = 'https://stores.healthmart.com/athenspharmacy/stores.aspx'
response = requests.get(url)
soupdata = BeautifulSoup(response.text,'html.parser')

scripts = soupdata.find_all('script')
jsonObj = None

for script in scripts:
    if 'dataLayer ='  in script.text:
        jsonStr = script.text
        jsonStr = jsonStr.split('dataLayer = [')[1]
        jsonStr = jsonStr.split('];')[0]
        jsonStr = jsonStr.replace("'", '"')
        jsonStr = ','.join(jsonStr.split(',')[0:-1]) + '}'

        jsonObj = json.loads(jsonStr)

print (jsonObj['acct.NCPDP'], jsonObj['acct.storeId'])

输出:

print (jsonObj['acct.NCPDP'], jsonObj['acct.storeId'])
0140044 2068