from urllib.request import urlopen
from bs4 import BeautifulSoup
import lxml
import time #to add delay
url ='https://www.fundoodata.com/companies-detail/Banctec-Tps-India-Pvt-
Ltd/48600.html' #from where i need data
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)
links= soup.find('script')
print(links)
这给出了->
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "Organization",
"address": {
"@type": "PostalAddress",
"addressLocality": "3rd Floor, Sharda Arcade, Pune Satara Road,
Bibvewadi",
"postalCode": "411016 ",
"streetAddress": " Pune/Maharashtra "
},
"name": "Banctec Tps India Pvt Ltd",
"telephone": "(020) "
}
</script>
我需要打印出词典中的地址词典,我需要访问addressLocality,邮政编码,街道地址。 尝试了不同的方法并失败了。
答案 0 :(得分:2)
Python中JSON格式的数据字符串,使用json.loads()反序列化
import json
links= soup.find('script')
print(links)
在此之后,
address = json.loads(links.text)['address']
print(address)
答案 1 :(得分:1)
使用json包:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import lxml
import time #to add dealay
import json
url ='https://www.fundoodata.com/companies-detail/Banctec-Tps-India-Pvt-Ltd/48600.html' #from where i need data
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)
links= soup.find_all('script')
print(links)
for script in links:
if '@context' in script.text:
jsonStr = script.string
jsonObj = json.loads(jsonStr)
print (jsonObj['address'])
输出:
print (jsonObj['address'])
{'@type': 'PostalAddress', 'addressLocality': '3rd Floor, Sharda Arcade, Pune Satara Road, Bibvewadi', 'postalCode': '411016 ', 'streetAddress': ' Pune/Maharashtra '}
答案 2 :(得分:1)
使用string
属性获取元素的文本,然后可以将其解析为JSON。
links_dict = json.loads(links.string)
address = links_dict['address']
答案 3 :(得分:0)
脚本标签通常包含大量的 JavaScript 内容。您可以使用正则表达式来隔离字典:
scripts = s.findAll('script')
for script in scripts:
if '@context' in script.text:
# Extra step to isolate the dictionary.
jsonStr = re.search(r'\{.*\}', str(script)).group()
# Create dictionary
dct = json.loads(jsonStr)
print(dct['address'])