我试图提取<a title="Jon Turner" class="_32mo" href="https://www.facebook.com/jon.turner.7587">
来源的'Daphne, Alabama'
和<a href="https://www.facebook.com/pages/Daphne-Alabama/104071176294827">Daphne, Alabama</a>
之间的文本requests.get
的标题和href
<div class="_3u1 _gli" data-bt="{"id":610823379,"rank":5,"abtest_version":null,"abtest_params":{"abtest_version":null,"origin":"A","ranker":null},"section":"main_column","owner_id":null,"sub_id":null,"browse_location":null,"query_data":[],"is_headline":false}" data-ft="{"tn":"-\\"}">
<div>
<div class="clearfix _ikh">
<div class="_4bl7 _3-90">
<a title="Jon Turner" class="_2ial" aria-label="Jon Turner" aria-hidden="true" tabindex="-1" role="presentation" href="https://www.facebook.com/jon.turner.7587">
<img class="_1glk _6phc img" src="https://scontent-sjc3-1.xx.fbcdn.net/v/t1.0-1/cp0/p74x74/104258995_10157724933523380_6784568501858187427_o.jpg?_nc_cat=111&_nc_sid=dbb9e7&_nc_oc=AQnwXFpW7dNBp-Tnx11O3pHh4-GhD8BxtkQ8tXJFYRmA1UdUET0O4-o8L_f5GOHfEjj5v9hEFvmf5nrX8M7gSibd&_nc_ht=scontent-sjc3-1.xx&oh=e01068610b697b26a26e78e7d6bfe728&oe=5F1938B5" width="72" height="72" alt="Jon Turner">
</a>
</div>
<div class="_4bl9">
<div data-testid="browse-result-content" class="_glj">
<div class="_5aj7">
<div class="_4bl9">
<div class="_gll">
<div class="_ajw">
<div style="-webkit-line-clamp: 2;" class="_52eh _5bcu">
<div>
<a title="Jon Turner" class="_32mo" href="https://www.facebook.com/jon.turner.7587">
<span>Jon Turner</span>
</a>
</div>
</div>
</div>
</div>
</div>
<div class="_4bl7">
<div class="_glk">
<a role="button" class="_42ft _4jy0 _4jy3 _517h _51sy" href="https://www.facebook.com/jon.turner.7587/photos" rel="dialog" ajaxify="/ajax/timeline/sign_up_dialog/?next=https%3A%2F%2Fwww.facebook.com%2Fjon.turner.7587%2Fphotos&entity_id=610823379&context=see_photos"><i class="_3-8_ img sp_l43kx7Dp4qP sx_b2b580"></i>See Photos
</a>
</div>
</div>
</div>
<div>
<div class="_glm">
<div class="_pac" data-bt="{"ct":"sub_headers"}">
<a href="https://www.facebook.com/pages/Daphne-Alabama/104071176294827">Daphne, Alabama</a>
<div class="_1my"></div></div></div><div class="_glo"></div>
</div>
<div class="_glp"></div>
<div class="_3t0c"></div></div></div></div></div></div>
在第一部分中,我尝试使用soup.find_all('a')
,但没有返回我想要的href。
答案 0 :(得分:1)
此脚本将获取标题和当前城市。另外,我将?locale=en_US
放在URL上只能得到英文HTML页面,而不是本地化的HTML页面。
import requests
from bs4 import BeautifulSoup
url = 'https://www.facebook.com/jon.turner.7587?locale=en_US'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
title = soup.select_one('#fb-timeline-cover-name')
print(title.text if title else '-')
city = soup.select_one('div:contains("Current city"):not(:has(div))')
print(city.find_previous('span').text if city else '-')
打印:
Jon Turner
Daphne, Alabama
编辑:对于url="https://www.facebook.com/public/jon-turner?locale=en_US"
import requests
from bs4 import BeautifulSoup
url = 'https://www.facebook.com/public/jon-turner?locale=en_US'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for code in soup.select('code'):
s = BeautifulSoup(code.contents[0], 'html.parser')
for result in s.select('[data-testid="browse-result-content"]'):
name = result.select_one('a > span').get_text(strip=True)
place_work = result.select_one('[data-bt]').get_text(strip=True, separator=' ')
print(name, place_work)
打印:
Jon Turner
Jon Turner
Jon Turner
Jonathan Turner
Jon Turner Daphne, Alabama
Jon Turner Taylor, Michigan
Jon Turner Volunteer at Disability Allies East Brunswick Chapter
Jon Turner Owner at Turner Guitar Co.
Jon Turner Sales manager at Tim Short Chevrolet of South Williamson
Jon Turner
Jon Turner electrician/ farm stuff at SEARS
Jon Turner Bradford High School
Jon Turner Cincinnati
Jon Turner
Jon Turner
编辑:要提取href=
,您可以执行以下操作:
import requests
from bs4 import BeautifulSoup
url = 'https://www.facebook.com/public/jon-turner?locale=en_US'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for code in soup.select('code'):
s = BeautifulSoup(code.contents[0], 'html.parser')
for result in s.select('[data-testid="browse-result-content"]'):
name = result.select_one('a > span').get_text(strip=True)
href = result.select_one('a')['href']
place_work = result.select_one('[data-bt]').get_text(strip=True, separator=' ')
print('{:<12} {:<60} {}'.format(name, place_work, href))
打印:
Jon Turner https://www.facebook.com/jon.turner.359
Jon Turner https://www.facebook.com/people/Jon-Turner/100013646792198
Jon Turner Operation Support Manager at Brammer Buck & Hickman https://www.facebook.com/jon.turner.96930
Jon Turner https://www.facebook.com/jon.turner.14855377
Jon Turner Daphne, Alabama https://www.facebook.com/jon.turner.7587
Jon Turner https://www.facebook.com/jon.turner.904
Jon Turner https://www.facebook.com/people/Jon-Turner/100017624107252
Jon Turner Owner at Turner Guitar Co. https://www.facebook.com/jon.turner.92560
Jon Turner Sales manager at Tim Short Chevrolet of South Williamson https://www.facebook.com/jon.turner.5623
Jon Turner https://www.facebook.com/people/Jon-Turner/100017624107252
Jon Turner Bradford High School https://www.facebook.com/jon.turner.370
Jon Turner https://www.facebook.com/jon.turner.758399
Jon Turner electrician/ farm stuff at SEARS https://www.facebook.com/jon.turner.79
Jon Turner Owner-operator at JT Improvements https://www.facebook.com/jon.turner.923724
答案 1 :(得分:0)
虽然这可能不是最佳做法,但是您可以将父div与类名一起使用来获取父,然后从该父div中获取所需的孩子。
此代码假定您已发出请求,并将requests.get()
结果存储在名为req_get
的变量中。
from bs4 import BeautifulSoup
# perform the http request and get the result html
# assuming your html is in req_get varaible
soup = BeautifulSoup(req_get.content, 'html5lib')
# soup = BeautifulSoup(req_get.content, 'html.parser')
parent = soup.find('div', attrs={'class': '_52eh'})
title = parent.div.a['title']
href = parent.div.a['href']
parent = soup.find('div', attrs={'class':'"_pac'})
location = parent.a.text
print(title)
print(href)
print(location)
这应该足以开始。