解析包含HTML和Json数据的网页

时间:2014-10-14 02:46:33

标签: json beautifulsoup python-requests

如果我们使用以下格式继续使用任何Linkdn个人资料,

https://www.linkedin.com/profile/view?id=******

它的页面源包含HTML和Json(类似)数据。

例如,是页面部分的快照,

    <div id="wrapper" class="   ">
    <noscript>
    <div class="alert attention">
    <p>To use the new LinkedIn Profile, you need to use a JavaScript-enabled browser.</p>
    </div>
    </noscript>
    <div id="profile" data-target-section="" class="
        edit-optimize-b

      ">
    <script src="https://static.licdn.com/scds/concat/common/js?h=8qji796o7luuc5qebeklnfxg-6k4804204n2d64ar53frzjflx-am5l5iawgd1xydon33b2zrxrm-8ksuhtsv75iwx507s59ozxg8z-4om608zf4z1l16u6gia1r1nl4-etd36al748tj3xjw93tjsq28r-18xghyk5dj4ylzwoeowf2v4nl-81p41hg3ea2ppz1r1qqnkihr9-600kewn3yidvo0tdhxbinv5w4-55xgeab0q21kozzkkwgg4ll3n-daug78gfu797lpts0ra7vaxs2-i65w1vlxwysral3p97aa6qz1-9z35pc580j5fhgi9i1tyjy222-5d5c9yntbr1nz9vis4prnxrjn-8fr13b2hdtmd3ku2bqtvxbko5-1jkaqitq9g3cn3dew10xyowbo-5enm4gpqbu8k8pt3qozz2kvaa-dajp41q0p1qlrytp0gi7o3g18-3v23bn6h3o20jmxlwt3umhpgb-74zch2ojvsl7e17jxmdj0sh4t-a9vgeg22sqt8yvzjgsk0equ22-f53t6qzi2u49p4vajj0zxjrte-7kj5pd97nkqk125irflei8wwq-dak9fbazcz0dotcqsq8gmyy2e-aubsc0o5q251ep2ufgdq14054-d9svgbrsldq3s97yxa8yds3ot-8qpeop8m0699wvo7vl94i8h3n-2as1ndilndxua9jy3u8xn5uzi-27liyun53w7ijiaonselucnnk-60ogya2ejpokb2qem28m4vctf-1oui4pqn18obmsjw0o86jzp3e-b8h30nmba8sv0pvy2inw7ck21-9q298vhth0t671o1ielnqwehr-8an1nf43qsn9c9lwkspkuuw70-76t2yu8hu5c3p3aihln6nwdhr-c0iqmzoavt0ocse54nrqo6idg&fc=2" type="text/javascript"></script><code id="profile_v2_guided_edit_promo-content" style="display:none;">

<!--{"content":{"something_went_wrong":"Sorry, something went wrong. Please try again.","lix_show_premium_toggle_settings":"control","EndorseDialogJS":"https://static.licdn.com/scds/concat/common/js?h=3gtm46fgengh7teck5sse5647\u002ddvpi6u7xt7458bie98t378c7j\u002da5shq2aqp1lrabprnnh0rhkjh&fc=2","i18n_our_server_has_encountered_an_error":"Sorry, our server has encountered an error. Please try again later.","profile_v2_megaphone_articles":{"formattedInfluencerName":"****","basic_info":{"industryID":97,"showTopCardDetail":true,"visible":true,"isPortfolio":false,"completenessLevel":9,"profilePagekey":"nprofile_self"

现在我想要的所有数据都在线下,

 <!--{"content":{"som

如果我使用beautifulsoup,它就不会解析这些数据。 如果我加载json()它也不会加载,因为它不完全是Json

我不确定,如何解析这个?

有人可以提出建议吗?或者我正在做的事情是完全错误的吗?

请在回答之前通过请求会话访问该链接,否则您可能无法完全理解我的观点。

以下是供您参考的代码段

import requests
from bs4 import BeautifulSoup as bs

username=username
password=passwd


s=requests.Session()
s.headers.update({"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36"})

r=s.get("https://www.linkedin.com/")
soup=bs(r.content)

loginCsrfParam=soup.find(id="loginCsrfParam-login")['value']
csrfToken=soup.find("input",{"name":"csrfToken"})['value']
sourceAlias=soup.find("input",{"name":"sourceAlias"})['value']

data={"isJsEnabled":"true",
"session_key":username,
"session_password":password,
"signin":"Sign In",
"loginCsrfParam":loginCsrfParam,
"csrfToken":csrfToken,
"sourceAlias":sourceAlias}

r=s.post("https://www.linkedin.com/uas/login-submit",data=data)
r=s.get("https://www.linkedin.com/profile/view?id=******")
stuff=r.content

1 个答案:

答案 0 :(得分:2)

我很高兴终于找到答案。令人惊讶的是答案是在bs4文档本身。

抓取HTML评论的方法是使用.string

soup=BeautifulSoup(r.content)
comment=soup.find("script").string # Which in this case is a dict within string so,
comment=json.loads(comment)

现在评论是在json: - )