我正在抓取一个网站(https://www.zhihu.com/people/xie-ke-41/followers),我希望得到所有关注者的信息。正如您所看到的,一些关注者的信息带有AJAX,我使用chrome中的开发人员工具并找到网址the url which has followers' information
我的代码:
import requests
from bs4 import BeautifulSoup
zhihu_rl = 'https://www.zhihu.com/node/ProfileFollowersListV2'
data = {
'method': 'next',
'params': '{"offset":20,"order_by":"created","hash_id":"86858a7a4aa77d290364625efcaacb70"}'}
headers = {
'Host': 'www.zhihu.com',
'Origin': 'https://www.zhihu.com',
'Referer': 'https://www.zhihu.com/people/xie-ke-41/followers',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'X-Xsrftoken': 'foo',
'Cookie':'xxxxxxxxxxxx'}
rep = requests.post(url=zhihu_rl, data=data, headers=headers)
bsobj = BeautifulSoup(rep.text, 'html.parser')
print(bsobj.find_all('div', {'class': "zm-profile-card zm-profile-section-item zg-clear no-hovercard"}))
并返回一个空列表。 我可以看到信息是开发人员的工具: ,为什么bs4不能提取它们? PS:我可以获得所有div,但是当我限制属性时。失败
答案 0 :(得分:1)
问题是你已经逃过了json,如果你打印bsobj就可以看到输出如下:
{"r":0,
"msg": ["<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"6327483c9e474097e7dbb2493a7f277c\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8\u4ed6<\/button>\n<\/div>\n<a title=\"\u738b\u5728\u9014\"\ndata-hovercard=\"p$t$wang-zai-tu-81\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/wang-zai-tu-81\">\n<img src=\"https:\/\/pic1.zhimg.com\/da8e974dc_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$wang-zai-tu-81\" href=\"https:\/\/www.zhihu.com\/people\/wang-zai-tu-81\" class=\"zg-link author-link\" title=\"\u738b\u5728\u9014\"\n>\u738b\u5728\u9014<\/a><\/span><\/h2>\n\n<div class=\"summary-wrapper summary-wrapper--medium\">\n\n<span class=\"bio\"><\/span>\n<\/div>\n<div class=\"details zg-gray\">\n<a target=\"_blank\" href=\"\/people\/wang-zai-tu-81\/followers\" class=\"zg-link-gray-normal\">1 \u5173\u6ce8\u8005<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/wang-zai-tu-81\/asks\" class=\"zg-link-gray-normal\">0 \u63d0\u95ee<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/wang-zai-tu-81\/answers\" class=\"zg-link-gray-normal\">1 \u56de\u7b54<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/wang-zai-tu-81\" class=\"zg-link-gray-normal\">0 \u8d5e\u540c<\/a>\n<\/div>\n\n<\/div>\n<\/div>","<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"a3596eaecae6f05f0ddf95dfcc6b5517\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8<\/button>\n<\/div>\n<a title=\"\u7075\u9b42\"\ndata-hovercard=\"p$t$ling-hun-30-21\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/ling-hun-30-21\">\n<img src=\"https:\/\/pic1.zhimg.com\/da8e974dc_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$ling-hun-30-21\" href=\"https:\/\/www.zhihu.com\/people\/ling-hun-30-21\" class=\"zg-link author-link\" title=\"\u7075\u9b42\"\n>\u7075\u9b42<\/a><\/span><\/h2>\n\n<div class=\"summary-wrapper summary-wrapper--medium\">\n\n<span class=\"bio\"><\/span>\n<\/div>\n<div class=\"details zg-gray\">\n<a target=\"_blank\" href=\"\/people\/ling-hun-30-21\/followers\" class=\"zg-link-gray-normal\">0 \u5173\u6ce8\u8005<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/ling-hun-30-21\/asks\" class=\"zg-link-gray-normal\">0 \u63d0\u95ee<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/ling-hun-30-21\/answers\" class=\"zg-link-gray-normal\">0 \u56de\u7b54<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/ling-hun-30-21\" class=\"zg-link-gray-normal\">0 \u8d5e\u540c<\/a>\n<\/div>\n\n<\/div>\n<\/div>","<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"74fad3af2b93f7da69c37eda64c31037\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8<\/button>\n<\/div>\n<a title=\"\u5f90\u6668\"\ndata-hovercard=\"p$t$xu-chen-77-49\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/xu-chen-77-49\">\n<img src=\"https:\/\/pic1.zhimg.com\/da8e974dc_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$xu-chen-77-49\" href=\"https:\/\/www.zhihu.com\/people\/xu-chen-77-49\" class=\"zg-link author-link\" title=\"\u5f90\u6668\"\n>\u5f90\u6668<\/a><\/span><\/h2>\n\n<div class=\"summary-wrapper summary-wrapper--medium\">\n\n<span class=\"bio\">\u4f1a\u8ba1\u5e08<\/span>\n<\/div>\n<div class=\"details zg-gray\">\n<a target=\"_blank\" href=\"\/people\/xu-chen-77-49\/followers\" class=\"zg-link-gray-normal\">0 \u5173\u6ce8\u8005<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/xu-chen-77-49\/asks\" class=\"zg-link-gray-normal\">0 \u63d0\u95ee<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/xu-chen-77-49\/answers\" class=\"zg-link-gray-normal\">0 \u56de\u7b54<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/xu-chen-77-49\" class=\"zg-link-gray-normal\">0 \u8d5e\u540c<\/a>\n<\/div>\n\n<\/div>\n<\/div>","<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"032b36abfbe05a30913c794a4b099629\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8\u5979<\/button>\n<\/div>\n<a title=\"Shuai Zhang\"\ndata-hovercard=\"p$t$shuai-zhang-49\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/shuai-zhang-49\">\n<img src=\"https:\/\/pic2.zhimg.com\/v2-8aa42ff00873460e29444d62ff51acfd_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$shuai-zhang-49\" href=\"https:\/\/www.zhihu.com\/people\/shuai-zhang-49\" class=\"zg-link author-link\" title=\"Shuai Zhang\"\n>Shuai Zhang<\/a><\/span><\/h2>\n\n<div class=\"summary-wrapper summary-wrapper--medium\">\n\n<span class=\"bio\"><\/span>\n<\/div>\n<div class=\"details zg-gray\">\n<a target=\"_blank\" href=\"\/people\/shuai-zhang-49\/followers\" class=\"zg-link-gray-normal\">79 \u5173\u6ce8\u8005<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/shuai-zhang-49\/asks\" class=\"zg-link-gray-normal\">1 \u63d0\u95ee<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/shuai-zhang-49\/answers\" class=\"zg-link-gray-normal\">119 \u56de\u7b54<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/shuai-zhang-49\" class=\"zg-link-gray-normal\">174 \u8d5e\u540c<\/a>\n<\/div>\n\n<\/div>\n<\/div>","<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"6388162f5357ca1bd872dc0b6efe4802\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8\u4ed6<\/button>\n<\/div>\n<a title=\"\u5468\u5468\"\ndata-hovercard=\"p$t$zhou-zhou-69-22\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/zhou-zhou-69-22\">\n<img src=\"https:\/\/pic1.zhimg.com\/da8e974dc_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$zhou-zhou-69-22\" href=\"https:\/\/www.zhihu.com\/people\/zhou-zhou-69-22\" class=\"zg-link author-link\" title=\"\u5468\u5468\"\n>\u5468\u5468<\/a><\/span><\/h2>\n\n<div class=\"summary-wrapper summary-wrapper--medium\">\n\n<span class=\"bio\"><\/span>\n<\/div>\n<div class=\"details zg-gray\">\n<a target=\"_blank\" href=\"\/people\/zhou-zhou-69-22\/followers\" class=\"zg-link-gray-normal\">4 \u5173\u6ce8\u8005<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/zhou-zhou-69-22\/asks\" class=\"zg-link-gray-normal\">0 \u63d0\u95ee<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/zhou-zhou-69-22\/answers\" class=\"zg-link-gray-normal\">7 \u56de\u7b54<\/a>\n\/\n<a target=\"_blank\" href=\"\/people\/zhou-zhou-69-22\" class=\"zg-link-gray-normal\">1 \u8d5e\u540c<\/a>\n<\/div>\n\n<\/div>\n<\/div>","<div class=\"zm-profile-card zm-profile-section-item zg-clear no-hovercard\">\n<div class=\"zg-right\">\n<button data-follow=\"m:button\" data-id=\"3a1a9da0e0bb4abe2554fa2a6032f27f\" class=\"zg-btn zg-btn-follow zm-rich-follow-btn small nth-0\">\u5173\u6ce8\u5979<\/button>\n<\/div>\n<a title=\"\u7f8e\u7f8e\u836f\u5242\u5e08\"\ndata-hovercard=\"p$t$sui-nuo-81\"\nclass=\"zm-item-link-avatar\"\nhref=\"\/people\/sui-nuo-81\">\n<img src=\"https:\/\/pic2.zhimg.com\/ae23b8e89725a24de650dee53e9a60a5_m.jpg\" class=\"zm-item-img-avatar\">\n<\/a>\n<div class=\"zm-list-content-medium\">\n<h2 class=\"zm-list-content-title\"><span class=\"author-link-line\">\n<a data-hovercard=\"p$t$sui-nuo-81\" href=\"https:\/\/www.zhihu.com\/people\/sui-nuo-81\" class=\"zg-link
不幸的是它也是无效的 json 所以我们无法调用req.json()
并获得好的未转义的html,因此您必须使用 string_escape 手动执行此操作:
In [14]: rep = requests.post(url=zhihu_rl, data=data, headers=headers)
In [15]: bsobj = BeautifulSoup(rep.text.decode("string_escape"), 'lxml')
In [16]: ancs = (bsobj.find_all('div', {'class': 'zm-profile-card zm-profile-section-item zg-clear no-hovercard'}))
In [17]: len(ancs)
Out[17]: 20
它也是 zm-profile-section-item
而不是 zm-profile-section- item
此外,将来永远不会发布登录cookie,我可以在几分钟内完全访问您的帐户。
答案 1 :(得分:-2)