使用beautifulsoup在嵌套div下获取标题和href

时间:2020-06-24 05:58:37

标签: python-3.x beautifulsoup

我试图提取<a title="Jon Turner" class="_32mo" href="https://www.facebook.com/jon.turner.7587">来源的'Daphne, Alabama'<a href="https://www.facebook.com/pages/Daphne-Alabama/104071176294827">Daphne, Alabama</a>之间的文本requests.get的标题和href

<div class="_3u1 _gli" data-bt="{&quot;id&quot;:610823379,&quot;rank&quot;:5,&quot;abtest_version&quot;:null,&quot;abtest_params&quot;:{&quot;abtest_version&quot;:null,&quot;origin&quot;:&quot;A&quot;,&quot;ranker&quot;:null},&quot;section&quot;:&quot;main_column&quot;,&quot;owner_id&quot;:null,&quot;sub_id&quot;:null,&quot;browse_location&quot;:null,&quot;query_data&quot;:[],&quot;is_headline&quot;:false}" data-ft="{&quot;tn&quot;:&quot;-\\&quot;}">
<div>
<div class="clearfix _ikh">
<div class="_4bl7 _3-90">
<a title="Jon Turner" class="_2ial" aria-label="Jon Turner" aria-hidden="true" tabindex="-1" role="presentation" href="https://www.facebook.com/jon.turner.7587">
<img class="_1glk _6phc img" src="https://scontent-sjc3-1.xx.fbcdn.net/v/t1.0-1/cp0/p74x74/104258995_10157724933523380_6784568501858187427_o.jpg?_nc_cat=111&amp;_nc_sid=dbb9e7&amp;_nc_oc=AQnwXFpW7dNBp-Tnx11O3pHh4-GhD8BxtkQ8tXJFYRmA1UdUET0O4-o8L_f5GOHfEjj5v9hEFvmf5nrX8M7gSibd&amp;_nc_ht=scontent-sjc3-1.xx&amp;oh=e01068610b697b26a26e78e7d6bfe728&amp;oe=5F1938B5" width="72" height="72" alt="Jon Turner">
</a>
</div>

<div class="_4bl9">
<div data-testid="browse-result-content" class="_glj">
<div class="_5aj7">
<div class="_4bl9">
<div class="_gll">
<div class="_ajw">
<div style="-webkit-line-clamp: 2;" class="_52eh _5bcu">
<div>

<a title="Jon Turner" class="_32mo" href="https://www.facebook.com/jon.turner.7587">
<span>Jon Turner</span>
</a>
</div>
</div>
</div>
</div>
</div>

<div class="_4bl7">
<div class="_glk">
<a role="button" class="_42ft _4jy0 _4jy3 _517h _51sy" href="https://www.facebook.com/jon.turner.7587/photos" rel="dialog" ajaxify="/ajax/timeline/sign_up_dialog/?next=https%3A%2F%2Fwww.facebook.com%2Fjon.turner.7587%2Fphotos&amp;entity_id=610823379&amp;context=see_photos"><i class="_3-8_ img sp_l43kx7Dp4qP sx_b2b580"></i>See Photos
</a>
</div>
</div>
</div>
<div>

<div class="_glm">
<div class="_pac" data-bt="{&quot;ct&quot;:&quot;sub_headers&quot;}">
<a href="https://www.facebook.com/pages/Daphne-Alabama/104071176294827">Daphne, Alabama</a>
<div class="_1my"></div></div></div><div class="_glo"></div>
</div>

<div class="_glp"></div>

<div class="_3t0c"></div></div></div></div></div></div>

在第一部分中,我尝试使用soup.find_all('a'),但没有返回我想要的href。

2 个答案:

答案 0 :(得分:1)

此脚本将获取标题和当前城市。另外,我将?locale=en_US放在URL上只能得到英文HTML页面,而不是本地化的HTML页面。

import requests
from bs4 import BeautifulSoup


url = 'https://www.facebook.com/jon.turner.7587?locale=en_US'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

title = soup.select_one('#fb-timeline-cover-name')
print(title.text if title else '-')

city = soup.select_one('div:contains("Current city"):not(:has(div))')
print(city.find_previous('span').text if city else '-')

打印:

Jon Turner
Daphne, Alabama

编辑:对于url="https://www.facebook.com/public/jon-turner?locale=en_US"

import requests
from bs4 import BeautifulSoup


url = 'https://www.facebook.com/public/jon-turner?locale=en_US'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for code in soup.select('code'):
    s = BeautifulSoup(code.contents[0], 'html.parser')
    for result in s.select('[data-testid="browse-result-content"]'):
        name = result.select_one('a > span').get_text(strip=True)
        place_work = result.select_one('[data-bt]').get_text(strip=True, separator=' ')
        print(name, place_work)

打印:

Jon Turner 
Jon Turner 
Jon Turner 
Jonathan Turner 
Jon Turner Daphne, Alabama
Jon Turner Taylor, Michigan
Jon Turner Volunteer at Disability Allies East Brunswick Chapter
Jon Turner Owner at Turner Guitar Co.
Jon Turner Sales manager at Tim Short Chevrolet of South Williamson
Jon Turner 
Jon Turner electrician/ farm stuff at SEARS
Jon Turner Bradford High School
Jon Turner Cincinnati
Jon Turner 
Jon Turner 

编辑:要提取href=,您可以执行以下操作:

import requests
from bs4 import BeautifulSoup


url = 'https://www.facebook.com/public/jon-turner?locale=en_US'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for code in soup.select('code'):
    s = BeautifulSoup(code.contents[0], 'html.parser')
    for result in s.select('[data-testid="browse-result-content"]'):
        name = result.select_one('a > span').get_text(strip=True)
        href = result.select_one('a')['href']
        place_work = result.select_one('[data-bt]').get_text(strip=True, separator=' ')
        print('{:<12} {:<60} {}'.format(name, place_work, href))

打印:

Jon Turner                                                                https://www.facebook.com/jon.turner.359
Jon Turner                                                                https://www.facebook.com/people/Jon-Turner/100013646792198
Jon Turner   Operation Support Manager at Brammer Buck & Hickman          https://www.facebook.com/jon.turner.96930
Jon Turner                                                                https://www.facebook.com/jon.turner.14855377
Jon Turner   Daphne, Alabama                                              https://www.facebook.com/jon.turner.7587
Jon Turner                                                                https://www.facebook.com/jon.turner.904
Jon Turner                                                                https://www.facebook.com/people/Jon-Turner/100017624107252
Jon Turner   Owner at Turner Guitar Co.                                   https://www.facebook.com/jon.turner.92560
Jon Turner   Sales manager at Tim Short Chevrolet of South Williamson     https://www.facebook.com/jon.turner.5623
Jon Turner                                                                https://www.facebook.com/people/Jon-Turner/100017624107252
Jon Turner   Bradford High School                                         https://www.facebook.com/jon.turner.370
Jon Turner                                                                https://www.facebook.com/jon.turner.758399
Jon Turner   electrician/ farm stuff at SEARS                             https://www.facebook.com/jon.turner.79
Jon Turner   Owner-operator at JT Improvements                            https://www.facebook.com/jon.turner.923724

答案 1 :(得分:0)

虽然这可能不是最佳做法,但是您可以将父div与类名一起使用来获取父,然后从该父div中获取所需的孩子。

此代码假定您已发出请求,并将requests.get()结果存储在名为req_get的变量中。

from bs4 import BeautifulSoup

# perform the http request and get the result html
# assuming your html is in req_get varaible

soup = BeautifulSoup(req_get.content, 'html5lib')
# soup = BeautifulSoup(req_get.content, 'html.parser')

parent = soup.find('div', attrs={'class': '_52eh'})
title = parent.div.a['title']
href = parent.div.a['href']

parent = soup.find('div', attrs={'class':'"_pac'})
location = parent.a.text

print(title)
print(href)
print(location)

这应该足以开始。