如何使用beautifulsoup从这个HTML获取信息?

时间:2018-01-19 05:56:26

标签: python web-scraping beautifulsoup

我想从this获取公司的所有社交链接。做的时候

summary_div.find("div", {'class': "cp-summary__social-links"})

我收到了这个

<div class="cp-summary__social-links">
<div data-integration-name="react-component" data-payload='{"props":
{"links":[{"url":"http://www.snapdeal.com?utm_source=craft.co","icon":"web","label":"Website"},
{"url":"http://www.linkedin.com/company/snapdeal?utm_source=craft.co","icon":"linkedin","label":"LinkedIn"},
{"url":"https://instagram.com/snapdeal/?utm_source=craft.co","icon":"instagram","label":"Instagram"},
{"url":"https://www.facebook.com/Snapdeal?utm_source=craft.co","icon":"facebook","label":"Facebook"},
{"url":"https://www.crunchbase.com/organization/snapdeal?utm_source=craft.co","icon":"cb","label":"CrunchBase"},
{"url":"https://www.youtube.com/user/snapdeal?utm_source=craft.co","icon":"youtube","label":"YouTube"},
{"url":"https://twitter.com/snapdeal?utm_source=craft.co","icon":"twitter","label":"Twitter"}],
"companyName":"Snapdeal"},"name":"CompanyLinks"}' data-rwr-element="true"></div></div>

我也试过让cp-summary__social-links的孩子成为我想要的孩子,然后找到所有a标签以获取所有链接。这也行不通。

任何想法,怎么做?

更新:正如Sraw建议的那样,我设法通过这样做来获取所有网址。

urls = []

 social_link = summary_div.find("div", {'class': "cp-summary__social-links"}).find("div", {"data-integration-name": "react-component"})
    json_text = json.loads(social_link["data-payload"])
 for link in json_text['props']['links']:
     urls.append(link['url'])

提前致谢。

0 个答案:

没有答案