Question

from urllib.request import urlopen
from bs4 import BeautifulSoup
import lxml
import time #to add delay

url ='https://www.fundoodata.com/companies-detail/Banctec-Tps-India-Pvt- 
Ltd/48600.html' #from where i need data
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)
links= soup.find('script')
print(links)

这给出了->

<script type="application/ld+json">
    {
      "@context": "https://schema.org",
      "@type": "Organization",
        "address": {
        "@type": "PostalAddress",
        "addressLocality": "3rd Floor, Sharda Arcade, Pune Satara Road, 
Bibvewadi",
        "postalCode": "411016 ",
        "streetAddress": " Pune/Maharashtra "
      },
      "name": "Banctec Tps India Pvt Ltd",
      "telephone": "(020) "
    }
    </script>

我需要打印出词典中的地址词典，我需要访问addressLocality，邮政编码，街道地址。尝试了不同的方法并失败了。

Answer 1

Python中JSON格式的数据字符串，使用json.loads（）反序列化

import json
links= soup.find('script')
print(links)

在此之后，

address = json.loads(links.text)['address']
print(address)

Answer 2

使用json包：

from urllib.request import urlopen
from bs4 import BeautifulSoup
import lxml
import time #to add dealay
import json

url ='https://www.fundoodata.com/companies-detail/Banctec-Tps-India-Pvt-Ltd/48600.html' #from where i need data
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)
links= soup.find_all('script')
print(links)


for script in links:
    if '@context' in script.text:
        jsonStr = script.string
        jsonObj = json.loads(jsonStr)

print (jsonObj['address'])

输出：

print (jsonObj['address'])
{'@type': 'PostalAddress', 'addressLocality': '3rd Floor, Sharda Arcade, Pune Satara Road, Bibvewadi', 'postalCode': '411016 ', 'streetAddress': ' Pune/Maharashtra '}

Answer 3

使用string属性获取元素的文本，然后可以将其解析为JSON。

links_dict = json.loads(links.string)
address = links_dict['address']

Answer 4

脚本标签通常包含大量的 JavaScript 内容。您可以使用正则表达式来隔离字典：

scripts = s.findAll('script')
    for script in scripts:
        if '@context' in script.text:

            # Extra step to isolate the dictionary.
            jsonStr = re.search(r'\{.*\}', str(script)).group()
            # Create dictionary
            dct = json.loads(jsonStr)

print(dct['address'])

使用漂亮的汤访问脚本标签内的字典

4 个答案: