美丽的汤试图获取有关<code>/&lt;-- comment tag</code>的信息

时间:2014-12-18 21:07:26

标签: python facebook web-scraping beautifulsoup mechanize

我正在尝试通过他们的个人资料页面获取Facebook用户的个人资料图片(如果这些图片可以在他们的公开个人资料中找到)

我很难通过漂亮的汤来获得它。

目前我正在使用以下代码查找图片链接的位置:

from urllib import urlopen
import mechanize
from bs4 import BeautifulSoup

br = mechanize.Browser()
br.set_handle_robots(False)

page_open = br.open("https://www.facebook.com/zuck")

x= soup.find(id="u_0_6") #change sometime with "u_0_5"
strx = str(x)
strx[2469:2690]  #really bad choice

从最后一行开始,只有在前一个代码没有改变且永远不会发生的情况下,我才能提取网址。 如何获取数据

"https://fbcdn-profile-a.akamaihd.net/hprofile-ak-xfa1/v/t1.0-1/c14.4.153.153/1939620_10101266232851011_437577509_n.jpg?oh=014037065b8baa2346444c66b16ddc25&oe=5547259F&__gda__=1429976114_ffd73e14776a391219e64a1ce6a4d1fb"

位于html的这一部分:

<code class="hidden_elem" id="u_0_6"><!-- <div class="timelineLoggedOutSignUp"><div class="_5h60" id="pagelet_loggedout_sign_up" data-referrer="pagelet_loggedout_sign_up"></div></div><div class="fbTimelineTopSectionBase _6-d _529n"><div class="_5h60" id="pagelet_above_header_timeline" data-referrer="pagelet_above_header_timeline"></div><div id="above_header_timeline_placeholder"></div><div class="fbTimelineSection mtm fbTimelineTopSection fbTimelineLoggedOutTopSection"><div id="fbProfileCover"><div class="cover" id="u_0_2"><a class="coverWrap coverImage" href="https://www.facebook.com/photo.php?fbid=10101026493146301&amp;set=a.941146602501.2418915.4&amp;type=1" rel="theater" ajaxify="https://www.facebook.com/photo.php?fbid=10101026493146301&amp;set=a.941146602501.2418915.4&amp;type=1&amp;src=https%3A%2F%2Ffbcdn-sphotos-a-a.akamaihd.net%2Fhphotos-ak-frc3%2Ft31.0-8%2F1275272_10101026493146301_791186452_o.jpg&amp;smallsrc=https%3A%2F%2Ffbcdn-sphotos-a-a.akamaihd.net%2Fhphotos-ak-xap1%2Fv%2Ft1.0-9%2F1186268_10101026493146301_791186452_n.jpg%3Foh%3Dfc0981d4a65c2e984cf5c43fdc1bcc88%26oe%3D55072936%26__gda__%3D1430325870_8783e46096a8a5456fc0e745fb89f303&amp;size=1434%2C717&amp;source=10&amp;player_origin=profile" title="Photo de couverture" id="fbCoverImageContainer" data-cropped="1"><img class="coverPhotoImg photo img" src="https://fbcdn-sphotos-a-a.akamaihd.net/hphotos-ak-frc3/t31.0-8/q83/c0.93.1434.531/s851x315/1275272_10101026493146301_791186452_o.jpg" style="top:0px;width:100%" data-fbid="10101026493146301" alt="Photo de couverture" /><div class="coverBorder"></div><img class="coverChangeThrobber img" src="https://fbstatic-a.akamaihd.net/rsrc.php/v2/yk/r/LOOn0JtHNzb.gif" alt="" width="16" height="16" /></a></div><div id="fbTimelineHeadline" class="clearfix"><div class="actions"><div class="_5h60 actionsDropdown" id="pagelet_timeline_profile_actions" data-referrer="pagelet_timeline_profile_actions"></div></div><div class="name"><div class="photoContainer"><div><div class="profilePicThumb"><img class="profilePic img" alt="Mark Zuckerberg" src="https://fbcdn-profile-a.akamaihd.net/hprofile-ak-xfa1/v/t1.0-1/c14.4.153.153/1939620_10101266232851011_437577509_n.jpg?oh=014037065b8baa2346444c66b16ddc25&amp;oe=5547259F&amp;__gda__=1429976114_ffd73e14776a391219e64a1ce6a4d1fb" /></div></div><meta itemprop="image" content="https://fbcdn-profile-a.akamaihd.net/hprofile-ak-xfa1/v/t1.0-1/c14.4.153.153/s50x50/1939620_10101266232851011_437577509_n.jpg?oh=6b6cd8460210e1de160cf8a6056df416&amp;oe=550D5F6C&amp;__gda__=1429858477_b29a956770b6173d71cb28eb35fa99e6" /></div><h2 itemprop="name">Mark Zuckerberg<span data-hover="tooltip" data-tooltip-position="right" class="_56_f _5dzy _5d-1" id="u_0_4"></span></h2></div></div></div></div></div><div class="timelineLoggedOutPagelet"><div class="clearfix"><div class="timelineLoggedOutMain lfloat _ohe"><div class="_5h60 allFavorites" id="pagelet_all_favorites" data-referrer="pagelet_all_favorites"></div></div><div class="timelineLoggedOutRight rfloat _ohf"><div class="fbTimelineSection mtm fbTimelineCompactSection"><div class="_5h60" id="pagelet_search" data-referrer="pagelet_search"></div></div><div class="_5h60" id="pagelet_people_same_name" data-referrer="pagelet_people_same_name"></div><div class="_5h60" id="pagelet_contact" data-referrer="pagelet_contact"></div></div></div></div> --></code>

2 个答案:

答案 0 :(得分:1)

或者不是刮Facebook,你可以通过他们的图形API以正确的方式做到这一点;)

import requests

url = "http://graph.facebook.com/{}".format("zuck")
params = { "fields": "picture" }
response = requests.get(url, params=params).json()

picture_url = response['picture']['data']['url']
print(picture_url)

# output:
# https://fbcdn-profile-a.akamaihd.net/hprofile-ak-xfa1/v/t1.0-1/c14.4.153.153/s50x50/1939620_10101266232851011_437577509_n.jpg?oh=6b6cd8460210e1de160cf8a6056df416&oe=550D5F6C&__gda__=1429858477_b29a956770b6173d71cb28eb35fa99e6

说明:个人资料图片网址是一个公共字段 - 您可以在不进行身份验证的情况下访问它。

<强>优点:

  1. 你甚至不必使用BeautifulSoup
  2. 响应速度更快
  3. 你这样做开发者的方式(而不是肮脏的黑客方式)
  4. 您使用的requestsurl lib
  5. 更实用

    玩Facebook图表api:https://developers.facebook.com/tools/explorer

答案 1 :(得分:0)

我不确定这是多么可靠,因为<code><!-- <div...对我来说看起来很奇怪,因为我对HTML知之甚少,但这段代码应该有效:

element= soup.find(id="u_0_6")
soup= BeautifulSoup(element.string)
image= soup.find('img', attrs={'class': ['profilePic']})
print image