使用漂亮的汤访问脚本标签内的字典

时间:2019-01-10 10:41:37

标签: python beautifulsoup

from urllib.request import urlopen
from bs4 import BeautifulSoup
import lxml
import time #to add delay

url ='https://www.fundoodata.com/companies-detail/Banctec-Tps-India-Pvt- 
Ltd/48600.html' #from where i need data
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)
links= soup.find('script')
print(links)

这给出了->

<script type="application/ld+json">
    {
      "@context": "https://schema.org",
      "@type": "Organization",
        "address": {
        "@type": "PostalAddress",
        "addressLocality": "3rd Floor, Sharda Arcade, Pune Satara Road, 
Bibvewadi",
        "postalCode": "411016 ",
        "streetAddress": " Pune/Maharashtra "
      },
      "name": "Banctec Tps India Pvt Ltd",
      "telephone": "(020) "
    }
    </script>

我需要打印出词典中的地址词典,我需要访问addressLocality,邮政编码,街道地址。 尝试了不同的方法并失败了。

4 个答案:

答案 0 :(得分:2)

Python中JSON格式的数据字符串,使用json.loads()反序列化

import json
links= soup.find('script')
print(links)

在此之后,

address = json.loads(links.text)['address']
print(address)

答案 1 :(得分:1)

使用json包:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import lxml
import time #to add dealay
import json

url ='https://www.fundoodata.com/companies-detail/Banctec-Tps-India-Pvt-Ltd/48600.html' #from where i need data
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)
links= soup.find_all('script')
print(links)


for script in links:
    if '@context' in script.text:
        jsonStr = script.string
        jsonObj = json.loads(jsonStr)

print (jsonObj['address'])

输出:

print (jsonObj['address'])
{'@type': 'PostalAddress', 'addressLocality': '3rd Floor, Sharda Arcade, Pune Satara Road, Bibvewadi', 'postalCode': '411016 ', 'streetAddress': ' Pune/Maharashtra '}

答案 2 :(得分:1)

使用string属性获取元素的文本,然后可以将其解析为JSON。

links_dict = json.loads(links.string)
address = links_dict['address']

答案 3 :(得分:0)

脚本标签通常包含大量的 JavaScript 内容。您可以使用正则表达式来隔离字典:

scripts = s.findAll('script')
    for script in scripts:
        if '@context' in script.text:

            # Extra step to isolate the dictionary.
            jsonStr = re.search(r'\{.*\}', str(script)).group()
            # Create dictionary
            dct = json.loads(jsonStr)

print(dct['address'])