使用beautifulsoup提取网址

时间:2018-02-21 20:31:04

标签: python web-scraping

使用此代码:

url = "https://github.com/searcho=desc&p=1&q=stars%3A%3E1&s=stars&type=Repositoris"
with urllib.request.urlopen(url) as response:

html = response.read()
html = html.decode('utf-8')

with open('page_content.html', 'w', encoding='utf-8') as new_file:

    new_file.write(html)

soup = BeautifulSoup(html,'lxml')

g_data= soup.findAll("a", {"class":"v-align-middle"})

print(g_data[0])

输出结果为:

<a class="v-align-middle" data-hydro-click='{"event_type":"search_result.click","payload":{"page_number":1,"query":"stars:&gt;1","result_position":1,"click_id":28457823,"result":{"id":28457823,"global_relay_id":"MDEwOlJlcG9zaXRvcnkyODQ1NzgyMw==","model_name":"Repository","url":"https://github.com/freeCodeCamp/freeCodeCamp"},"originating_request_id":"ECC6:1DF24:CE9C0F:1667572:5A8DDD6F"}}' data-hydro-hmac="42c4e038b86cefc302d5637e870e6d746ee7fa95eadf2b26930cb893c6a3bc53" href="/freeCodeCamp/freeCodeCamp">freeCodeCamp/freeCodeCamp</a>

如何从输出中提取以下网址: https://github.com/freeCodeCamp/freeCodeCamp

谢谢!

2 个答案:

答案 0 :(得分:1)

获取属性json.loads()的值,并将其作为常规python dict使用:

import json
# your other code, up to setting the g_data

data_hydro = g_data[0]['data-hydro-click']
data_hydro = json.loads(data_hydro)
print(data_hydro['payload']['result']['url'])

答案 1 :(得分:1)

它在一个json字符串里面,这就是为什么它很难找到

html = """
<h3>
<a href="/freeCodeCamp/freeCodeCamp" class="v-align-middle"data-hydro-click="{&quot;event_type&quot;:&quot;search_result.click&quot;,&quot;payload&quot;:{&quot;page_number&quot;:1,&quot;query&quot;:&quot;stars:>1&quot;,&quot;result_position&quot;:1,&quot;click_id&quot;:28457823,&quot;result&quot;:{&quot;id&quot;:28457823,&quot;global_relay_id&quot;:&quot;MDEwOlJlcG9zaXRvcnkyODQ1NzgyMw==&quot;,&quot;model_name&quot;:&quot;Repository&quot;,&quot;url&quot;:&quot;https://github.com/freeCodeCamp/freeCodeCamp&quot;},&quot;originating_request_id&quot;:&quot;EB94:4DE3:1D61C50:2AEAFBA:5A8D8E31&quot;}}" data-hydro-hmac="2b170325f8ff481731dd5f65d85e7e94a356f75bdafce1f9c5cc60d112cbc2f8">freeCodeCamp/freeCodeCamp</a>
</h3>
"""
soup = BeautifulSoup(html, 'lxml')
parsed_json = json.loads(soup.a.get('data-hydro-click'))
parsed_json['payload']['result']['url']
# returns 'https://github.com/freeCodeCamp/freeCodeCamp'