Question

对于有网页编程和基本Web Scraping经验的人（我不这样做），我确信这是一个简单的问题。

我的目标是通过抓取他们的“生物”段落来获取有关Chegg雇用的许多导师的信息。虽然我是网络抓取的新手，但我想这将涉及编码一个以递归方式点击导师链接的scaper：

List of Tutors

并且刮掉了导师的简历

使用Microsoft Edge DOM Explorer，我可以在页面的HTML中检测导师的链接标记：

Tutor's HTML link tag

然而，当我使用Python的“请求”模块来获取网页的HTML时，导师的链接就不存在了！奇怪的是，网页上的其他链接被检测到，但没有一个导师的链接。 Python代码如下所示：

import requests

r = requests.get('www.chegg.com/tutors/online-tutors/')

print r.content

有人可以告诉我这个问题，以及我应该学习什么（例如HTML编程，HTTP理论等），以便我能够处理这个项目吗？

Answer 1

每个专家的所有数据都在div中，expert-list-content类：

from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(requests.get("https://www.chegg.com/tutors/online-tutors/").content)
for ex in soup.select("div.expert-list-content"):
    print(ex.select_one("div.expert-description").text)

这会给你：

"Tutoring gives me great pleasure because I not only get to feel good about helping others, but my students also gain..."
"I was a teaching assistant as a graduate student in mathematics, and taught several classes as a postdoc. I have been a tutor..."
"I have always been the go-to student for notes, essay proofreading, and math instruction. I have tutored at the Latino..."
"In my senior year of high school, I worked as a Physics Teaching Assistant and through that, I honed skills necessary to..."
"Throughout the past eight years, I have had the incredible opportunity to work closely with over 200 students in..."
"I have worked as a teaching assistant in my college for core disciplinary courses. I have also conducted training sessions on..."
"Scott here. Originally from Tennessee and educated in Cornell University, I've been tutoring/teaching math for 10 years and..."
"I am currently pursuing dual BE Mechanical Engineering and M.Sc Mathematics degrees from BITS Pilani. I have had ample..."
"I am a specialist in language and linguistics, with a particular interest in the history and grammar of the English language..."
"I graduated 7 years before and since then have taught many students on a regular basis in Finance and Mathematics. I have..."

获取个人资料链接和名称：

for ex in soup.select("div.expert-list-content"):
  info = ex.select_one("div.expert-info a")
  print(info.text, info["href"])

这给了你：

(u'Aleria S.', '/tutors/online-tutors/Aleria-S-371573/')
(u'Douglas Z.', '/tutors/online-tutors/Douglas-Z-568826/')
(u'Carla S.', '/tutors/online-tutors/Carla-S-864918/')
(u'Vinit R.', '/tutors/online-tutors/Vinit-R-2031766/')
(u'Anastasia G.', '/tutors/online-tutors/Anastasia-G-65278/')
(u'Vinay S.', '/tutors/online-tutors/Vinay-S-85533/')
(u'Gunjan G.', '/tutors/online-tutors/Gunjan-G-2695711/')
(u'Scott M.', '/tutors/online-tutors/Scott-M-277743/')
(u'Saumya U.', '/tutors/online-tutors/Saumya-U-890305/')
(u'Ed M.', '/tutors/online-tutors/Ed-M-2895636/')

没有涉及Javascript，如果您在浏览器中右键单击并选择查看源，您可以看到它就在那里。如果它是动态创建的，则不会在 Microsoft Edge DOM Explorer 之外的源中看到它。通常，添加用户代理总是好的。

head = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"}
soup = BeautifulSoup(requests.get("https://www.chegg.com/tutors/online-tutors/", headers=head).content)

Python“请求”模块无法检测某些HTML链接标记

1 个答案: