Python - 美丽的汤 - 获取链接find_all

时间:2018-01-22 20:06:07

标签: python python-3.x beautifulsoup

我正在尝试使用beautifulsoup废弃一个网站,我的问题是我只是想在Html源代码中获取一个链接,但最终会得到一个可怕的列表

<div class="table-list-cell py-3 pl-3 v-align-middle member-avatar-cell css-truncate pr-0">
  <a href="/Member1">
  <img alt="@Member1" class="avatar float-left" height="48" src="https://avatars0.githubusercontent.com/u/xxxxxxx" width="48" />
</a>

我只想让/ Member1或@ Member1我的代码看起来像这样:

Membres={}
response = requests.get('https://github.com/orgs/xxxxxxxx/people?page=1')
soup = BeautifulSoup(response.content, "html.parser")
for e in soup.find_all("div",{"class":"table-list-cell py-3 pl-3 v-align-middle member-avatar-cell css-truncate pr-0"}):
    for d in e.find_all("a"):
        for f in d.find_all("img alt="):
            Membres[f]={}

因此,我试图在f中切断线路。&#39;并创建一个直接链接,例如:

for d in e.find_all("a", href=True):

如果某人有办法获得Member1名称,我的密钥中仍会有很多信息。

谢谢

2 个答案:

答案 0 :(得分:1)

您可以尝试使用简单的列表解析来从.gridresidencial { margin-top: calc(1.25% + 180px); width:101.5%; padding:0; } @media only screen and (min-width: 1930px) { .img-list { float: none;columns: 4;}} @media only screen and (min-width: 1285px) and (max-width:1930px) { .img-list { float: none;columns: 3;}} @media only screen and (min-width: 750px) and (max-width:1285px) { .img-list { float: none; columns: 2;}} .img-list { margin: 0 auto; text-align: center; padding:0; list-style-type: none; width:100%; -webkit-column-gap: 0px; /* Chrome, Safari, Opera */ -moz-column-gap: 0px; /* Firefox */ column-gap: 0px; } li { display: inline-block; vertical-align: top; text-align: center; padding: 0px; margin: 0 auto; float:none; position: relative; } li figure { padding: 5px; margin: 0 auto; width:100%; } .img-list img{ float: right; max-height:560px; max-width:1280px; margin:0; width:645px; height: 285px; } @media only screen and (min-width: 1930px) { span.text-content { float: center; columns: 4; margin-left:15px; opacity: 0; } span.text-content { color: #FFFFFF; cursor: pointer; display: table; height: 320px; margin-top:50px; position: absolute; top: 0; width: 100%; } span.text-content span { display: table-cell; text-align: center; vertical-align: middle; font-size:40px; font-family: "AktivGroteskStdBd"; line-height: 0.6; } h2 {text-align: center;} ul:after { display: table; clear: both; content: ''; } 标记中提取href

<a>

给出:

for e in soup.find_all("div",{"class":"table-list-cell py-3 pl-3 v-align-middle member-avatar-cell css-truncate pr-0"}):
    my_list = [a['href'] for a in e.find_all('a')] 

要将它们放入字典中,您可以使用类似的语法:

>>> my_list
['/Member1']

给出:

for e in soup.find_all("div",{"class":"table-list-cell py-3 pl-3 v-align-middle member-avatar-cell css-truncate pr-0"}):
    my_dict = {a['href']:'' for a in e.find_all('a')}

答案 1 :(得分:1)

您可以使用正则表达式:

import re
s = """
<div class="table-list-cell py-3 pl-3 v-align-middle member-avatar-cell css-truncate pr-0">
    <a href="/Member1">
    <img alt="@Member1" class="avatar float-left" height="48" src="https://avatars0.githubusercontent.com/u/xxxxxxx" width="48" />
  </a>
 """
user_data = dict(re.findall('<img alt="@(.*?)" class="avatar float-left" height="48" src="(.*?)" width="48" />', s))

输出:

{'Member1': 'https://avatars0.githubusercontent.com/u/xxxxxxx'}