无法从beautiful_soup对象中提取数据

时间:2016-09-25 01:25:20

标签: python beautifulsoup web-crawler

我正在抓取一个网站(https://www.zhihu.com/people/xie-ke-41/followers),我希望得到所有关注者的信息。正如您所看到的,一些关注者的信息带有AJAX,我使用chrome中的开发人员工具并找到网址the url which has followers' information

我的代码:

import requests
from bs4 import BeautifulSoup


zhihu_rl = 'https://www.zhihu.com/node/ProfileFollowersListV2'

data = {
'method': 'next',
'params': '{"offset":20,"order_by":"created","hash_id":"86858a7a4aa77d290364625efcaacb70"}'}

headers = {
'Host': 'www.zhihu.com',
'Origin': 'https://www.zhihu.com',
'Referer': 'https://www.zhihu.com/people/xie-ke-41/followers',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'X-Xsrftoken': 'foo',
'Cookie':'xxxxxxxxxxxx'}

rep = requests.post(url=zhihu_rl, data=data, headers=headers)

bsobj = BeautifulSoup(rep.text,  'html.parser')

print(bsobj.find_all('div', {'class': "zm-profile-card zm-profile-section-item zg-clear no-hovercard"}))

并返回一个空列表。 我可以看到信息是开发人员的工具: thr information i see in developers' tool ,为什么bs4不能提取它们? PS:我可以获得所有div,但是当我限制属性时。失败

2 个答案:

答案 0 :(得分:1)

问题是你已经逃过了json,如果你打印bsobj就可以看到输出如下:

{"r":0,
 "msg": ["<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"6327483c9e474097e7dbb2493a7f277c\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8\u4ed6<\/button>\n<\/div>\n<a title=\"\u738b\u5728\u9014\"\ndata-hovercard=\"p$t$wang-zai-tu-81\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/wang-zai-tu-81\">\n<img src=\"https:\/\/pic1.zhimg.com\/da8e974dc_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$wang-zai-tu-81\" href=\"https:\/\/www.zhihu.com\/people\/wang-zai-tu-81\" class=\"zg-link author-link\" title=\"\u738b\u5728\u9014\"\n>\u738b\u5728\u9014<\/a><\/span><\/h2>\n\n<div class=\"summary-wrapper summary-wrapper--medium\">\n\n<span class=\"bio\"><\/span>\n<\/div>\n<div class=\"details zg-gray\">\n<a target=\"_blank\" href=\"\/people\/wang-zai-tu-81\/followers\" class=\"zg-link-gray-normal\">1 \u5173\u6ce8\u8005<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/wang-zai-tu-81\/asks\" class=\"zg-link-gray-normal\">0 \u63d0\u95ee<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/wang-zai-tu-81\/answers\" class=\"zg-link-gray-normal\">1 \u56de\u7b54<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/wang-zai-tu-81\" class=\"zg-link-gray-normal\">0 \u8d5e\u540c<\/a>\n<\/div>\n\n<\/div>\n<\/div>","<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"a3596eaecae6f05f0ddf95dfcc6b5517\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8<\/button>\n<\/div>\n<a title=\"\u7075\u9b42\"\ndata-hovercard=\"p$t$ling-hun-30-21\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/ling-hun-30-21\">\n<img src=\"https:\/\/pic1.zhimg.com\/da8e974dc_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$ling-hun-30-21\" href=\"https:\/\/www.zhihu.com\/people\/ling-hun-30-21\" class=\"zg-link author-link\" title=\"\u7075\u9b42\"\n>\u7075\u9b42<\/a><\/span><\/h2>\n\n<div class=\"summary-wrapper summary-wrapper--medium\">\n\n<span class=\"bio\"><\/span>\n<\/div>\n<div class=\"details zg-gray\">\n<a target=\"_blank\" href=\"\/people\/ling-hun-30-21\/followers\" class=\"zg-link-gray-normal\">0 \u5173\u6ce8\u8005<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/ling-hun-30-21\/asks\" class=\"zg-link-gray-normal\">0 \u63d0\u95ee<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/ling-hun-30-21\/answers\" class=\"zg-link-gray-normal\">0 \u56de\u7b54<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/ling-hun-30-21\" class=\"zg-link-gray-normal\">0 \u8d5e\u540c<\/a>\n<\/div>\n\n<\/div>\n<\/div>","<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"74fad3af2b93f7da69c37eda64c31037\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8<\/button>\n<\/div>\n<a title=\"\u5f90\u6668\"\ndata-hovercard=\"p$t$xu-chen-77-49\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/xu-chen-77-49\">\n<img src=\"https:\/\/pic1.zhimg.com\/da8e974dc_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$xu-chen-77-49\" href=\"https:\/\/www.zhihu.com\/people\/xu-chen-77-49\" class=\"zg-link author-link\" title=\"\u5f90\u6668\"\n>\u5f90\u6668<\/a><\/span><\/h2>\n\n<div class=\"summary-wrapper summary-wrapper--medium\">\n\n<span class=\"bio\">\u4f1a\u8ba1\u5e08<\/span>\n<\/div>\n<div class=\"details zg-gray\">\n<a target=\"_blank\" href=\"\/people\/xu-chen-77-49\/followers\" class=\"zg-link-gray-normal\">0 \u5173\u6ce8\u8005<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/xu-chen-77-49\/asks\" class=\"zg-link-gray-normal\">0 \u63d0\u95ee<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/xu-chen-77-49\/answers\" class=\"zg-link-gray-normal\">0 \u56de\u7b54<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/xu-chen-77-49\" class=\"zg-link-gray-normal\">0 \u8d5e\u540c<\/a>\n<\/div>\n\n<\/div>\n<\/div>","<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"032b36abfbe05a30913c794a4b099629\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8\u5979<\/button>\n<\/div>\n<a title=\"Shuai Zhang\"\ndata-hovercard=\"p$t$shuai-zhang-49\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/shuai-zhang-49\">\n<img src=\"https:\/\/pic2.zhimg.com\/v2-8aa42ff00873460e29444d62ff51acfd_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$shuai-zhang-49\" href=\"https:\/\/www.zhihu.com\/people\/shuai-zhang-49\" class=\"zg-link author-link\" title=\"Shuai Zhang\"\n>Shuai Zhang<\/a><\/span><\/h2>\n\n<div class=\"summary-wrapper summary-wrapper--medium\">\n\n<span class=\"bio\"><\/span>\n<\/div>\n<div class=\"details zg-gray\">\n<a target=\"_blank\" href=\"\/people\/shuai-zhang-49\/followers\" class=\"zg-link-gray-normal\">79 \u5173\u6ce8\u8005<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/shuai-zhang-49\/asks\" class=\"zg-link-gray-normal\">1 \u63d0\u95ee<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/shuai-zhang-49\/answers\" class=\"zg-link-gray-normal\">119 \u56de\u7b54<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/shuai-zhang-49\" class=\"zg-link-gray-normal\">174 \u8d5e\u540c<\/a>\n<\/div>\n\n<\/div>\n<\/div>","<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"6388162f5357ca1bd872dc0b6efe4802\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8\u4ed6<\/button>\n<\/div>\n<a title=\"\u5468\u5468\"\ndata-hovercard=\"p$t$zhou-zhou-69-22\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/zhou-zhou-69-22\">\n<img src=\"https:\/\/pic1.zhimg.com\/da8e974dc_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$zhou-zhou-69-22\" href=\"https:\/\/www.zhihu.com\/people\/zhou-zhou-69-22\" class=\"zg-link author-link\" title=\"\u5468\u5468\"\n>\u5468\u5468<\/a><\/span><\/h2>\n\n<div class=\"summary-wrapper summary-wrapper--medium\">\n\n<span class=\"bio\"><\/span>\n<\/div>\n<div class=\"details zg-gray\">\n<a target=\"_blank\" href=\"\/people\/zhou-zhou-69-22\/followers\" class=\"zg-link-gray-normal\">4 \u5173\u6ce8\u8005<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/zhou-zhou-69-22\/asks\" class=\"zg-link-gray-normal\">0 \u63d0\u95ee<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/zhou-zhou-69-22\/answers\" class=\"zg-link-gray-normal\">7 \u56de\u7b54<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/zhou-zhou-69-22\" class=\"zg-link-gray-normal\">1 \u8d5e\u540c<\/a>\n<\/div>\n\n<\/div>\n<\/div>","<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"3a1a9da0e0bb4abe2554fa2a6032f27f\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8\u5979<\/button>\n<\/div>\n<a title=\"\u7f8e\u7f8e\u836f\u5242\u5e08\"\ndata-hovercard=\"p$t$sui-nuo-81\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/sui-nuo-81\">\n<img src=\"https:\/\/pic2.zhimg.com\/ae23b8e89725a24de650dee53e9a60a5_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$sui-nuo-81\" href=\"https:\/\/www.zhihu.com\/people\/sui-nuo-81\" class=\"zg-link 

不幸的是它也是无效的 json 所以我们无法调用req.json()并获得好的未转义的html,因此您必须使用 string_escape 手动执行此操作:

In [14]: rep = requests.post(url=zhihu_rl, data=data, headers=headers)

In [15]: bsobj = BeautifulSoup(rep.text.decode("string_escape"),  'lxml')

In [16]: ancs = (bsobj.find_all('div', {'class': 'zm-profile-card zm-profile-section-item zg-clear no-hovercard'}))

In [17]: len(ancs)
Out[17]: 20

它也是 zm-profile-section-item 而不是 zm-profile-section- item

此外,将来永远不会发布登录cookie,我可以在几分钟内完全访问您的帐户。

答案 1 :(得分:-2)

您使用了良好的标题组合,否则服务器可能无法识别您的标题,并且它认为您没有启用javascript。在限制属性使用。对于类和#为id。其他CSS选择器也可以正常工作。您还需要使用Selenium进行javascript执行(ajax调用),因为beautifulsoup缺少此功能 最后,确保网站没有防刮保护。在这种情况下,您需要使用像Js2Py

这样的javascript运行时