我试图使用TumblrAPI,PyTumblr来具体,在某些标签的帖子中抓取一些图片,
我使用的代码非常简单:
import pytumblr
from bs4 import BeautifulSoup
# Authenticate via API Key
client = pytumblr.TumblrRestClient('#Here is my API Key#')
print client.posts('wergida.tumblr.com', type='photo', tag='BERND AND HILLA BECHER', limit=1, offset=0)
所以结果是这样的:
{
"meta": {
"status": 200,
"msg": "OK"
},
"response": {
"blog": {
"title": "W é r G i d A",
"name": "wergida",
"total_posts": 1181,
"posts": 1181,
"url": "http://wergida.tumblr.com/",
"updated": 1466319493,
"description": "Ha bárkit érdekelne",
"is_nsfw": false,
"ask": false,
"ask_page_title": "Ask me anything",
"ask_anon": false,
"share_likes": true,
"likes": 1131
},
"posts": [
{
"blog_name": "wergida",
"id": 136740690571,
"post_url": "http://wergida.tumblr.com/post/136740690571/bernhard-bernd-becher-1931-2007-and-hilla",
"slug": "bernhard-bernd-becher-1931-2007-and-hilla",
"type": "photo",
"date": "2016-01-06 11:30:23 GMT",
"timestamp": 1452079823,
"state": "published",
"format": "html",
"reblog_key": "TiOl8nWT",
"tags": [
"industrial facades",
"bernd and hilla becher",
"photography",
"eisenhüttenstadt",
"brandenburg"
],
"short_url": "https://tmblr.co/ZaE70t1-MOLgB",
"summary": "Bernhard ‘Bernd’ Becher (1931-2007) and Hilla Becher (1934-2015): Eisenhüttenstadt, Brandenburg. Industrial Facades, The MIT...",
"recommended_source": null,
"recommended_color": null,
"highlighted": [],
"note_count": 2,
"caption": "<p>Bernhard ‘Bernd’ Becher (1931-2007) and Hilla Becher (1934-2015): Eisenhüttenstadt, Brandenburg. Industrial Facades, The MIT Press, 1995.<br/></p>",
"reblog": {
"tree_html": "",
"comment": "<p>Bernhard ‘Bernd’ Becher (1931-2007) and Hilla Becher (1934-2015): Eisenhüttenstadt, Brandenburg. Industrial Facades, The MIT Press, 1995.<br></p>"
},
"trail": [
{
"blog": {
"name": "wergida",
"active": true,
"theme": {
"avatar_shape": "square",
"background_color": "#FAFAFA",
"body_font": "Helvetica Neue",
"header_bounds": "",
"header_image": "https://secure.assets.tumblr.com/images/default_header/optica_pattern_05.png?_v=671444c5f47705cce40d8aefd23df3b1",
"header_image_focused": "https://secure.assets.tumblr.com/images/default_header/optica_pattern_05.png?_v=671444c5f47705cce40d8aefd23df3b1",
"header_image_scaled": "https://secure.assets.tumblr.com/images/default_header/optica_pattern_05.png?_v=671444c5f47705cce40d8aefd23df3b1",
"header_stretch": true,
"link_color": "#529ECC",
"show_avatar": true,
"show_description": true,
"show_header_image": true,
"show_title": true,
"title_color": "#444444",
"title_font": "Gibson",
"title_font_weight": "bold"
},
"share_likes": true,
"share_following": false
},
"post": {
"id": "136740690571"
},
"content_raw": "<p>Bernhard ‘Bernd’ Becher (1931-2007) and Hilla Becher (1934-2015): Eisenhüttenstadt, Brandenburg. Industrial Facades, The MIT Press, 1995.<br></p>",
"content": "<p>Bernhard ‘Bernd’ Becher (1931-2007) and Hilla Becher (1934-2015): Eisenhüttenstadt, Brandenburg. Industrial Facades, The MIT Press, 1995.<br /></p>",
"is_current_item": true,
"is_root_item": true
}
],
"image_permalink": "http://wergida.tumblr.com/image/136740690571",
"photos": [
{
"caption": "",
"alt_sizes": [
{
"url": "https://67.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_1280.jpg",
"width": 1280,
"height": 973
},
{
"url": "https://66.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_500.jpg",
"width": 500,
"height": 380
},
{
"url": "https://66.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_400.jpg",
"width": 400,
"height": 304
},
{
"url": "https://65.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_250.jpg",
"width": 250,
"height": 190
},
{
"url": "https://66.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_100.jpg",
"width": 100,
"height": 76
},
{
"url": "https://66.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_75sq.jpg",
"width": 75,
"height": 75
}
],
"original_size": {
"url": "https://67.media.tumblr.com/ea41a17d0febfd019c7afae5fcc6c51e/tumblr_nzk87tVlqk1s5ljg4o1_1280.jpg",
"width": 1280,
"height": 973
}
}
]
}
],
"total_posts": 223
}
}
&#13;
但是当我使用BeautifulSoup来解析我得到的信息时:
soup = BeautifulSoup(client.posts('wergida.tumblr.com', type='photo', tag='BERND AND HILLA BECHER', limit=1, offset=0),"lxml")
我明白了:
Traceback (most recent call last):
File "tumblr_test.py", line 29, in <module>
soup = BeautifulSoup(client.posts('wergida.tumblr.com', type='photo', tag='BERND AND HILLA BECHER', limit=1, offset=0),"lxml")
File "/Users/CB/Public/scrapy/env/lib/python2.7/site-packages/bs4/__init__.py", line 199, in __init__
if markup[:5] == "http:" or markup[:6] == "https:":
TypeError: unhashable type
我尝试过不同的解析器,例如&#34; html.parser&#34; &#34; html5lib&#34;,仍然会得到相同的错误。
感谢您提供任何线索!
答案 0 :(得分:2)
client.post()
调用返回 Python字典,而不是包含HTML的字符串;它已经为你解析了JSON响应。因为BeautifulSoup试图将其视为字符串,所以您会收到错误,因为:5
作为切片对象传递给字典,而且这不是可以删除的:
>>> {}[:5]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type
字典不是HTML。没有必要尝试使用BeautifulSoup解析它。只需访问嵌套结构中的各个数据元素;如果这样的元素本身就是一个字符串,并且该字符串包含HTML标记,则然后可能有意义地解析该特定数据:
response = client.posts('wergida.tumblr.com', type='photo', tag='BERND AND HILLA BECHER', limit=1, offset=0)
post = response['response']['posts'][0]
soup = BeautifulSoup(post['caption'])