对于有网页编程和基本Web Scraping经验的人(我不这样做),我确信这是一个简单的问题。
我的目标是通过抓取他们的“生物”段落来获取有关Chegg雇用的许多导师的信息。虽然我是网络抓取的新手,但我想这将涉及编码一个以递归方式点击导师链接的scaper:
并且刮掉了导师的简历
使用Microsoft Edge DOM Explorer,我可以在页面的HTML中检测导师的链接标记:
然而,当我使用Python的“请求”模块来获取网页的HTML时,导师的链接就不存在了!奇怪的是,网页上的其他链接被检测到,但没有一个导师的链接。 Python代码如下所示:
import requests
r = requests.get('www.chegg.com/tutors/online-tutors/')
print r.content
有人可以告诉我这个问题,以及我应该学习什么(例如HTML编程,HTTP理论等),以便我能够处理这个项目吗?
答案 0 :(得分:1)
每个专家的所有数据都在div中,expert-list-content
类:
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("https://www.chegg.com/tutors/online-tutors/").content)
for ex in soup.select("div.expert-list-content"):
print(ex.select_one("div.expert-description").text)
这会给你:
"Tutoring gives me great pleasure because I not only get to feel good about helping others, but my students also gain..."
"I was a teaching assistant as a graduate student in mathematics, and taught several classes as a postdoc. I have been a tutor..."
"I have always been the go-to student for notes, essay proofreading, and math instruction. I have tutored at the Latino..."
"In my senior year of high school, I worked as a Physics Teaching Assistant and through that, I honed skills necessary to..."
"Throughout the past eight years, I have had the incredible opportunity to work closely with over 200 students in..."
"I have worked as a teaching assistant in my college for core disciplinary courses. I have also conducted training sessions on..."
"Scott here. Originally from Tennessee and educated in Cornell University, I've been tutoring/teaching math for 10 years and..."
"I am currently pursuing dual BE Mechanical Engineering and M.Sc Mathematics degrees from BITS Pilani. I have had ample..."
"I am a specialist in language and linguistics, with a particular interest in the history and grammar of the English language..."
"I graduated 7 years before and since then have taught many students on a regular basis in Finance and Mathematics. I have..."
获取个人资料链接和名称:
for ex in soup.select("div.expert-list-content"):
info = ex.select_one("div.expert-info a")
print(info.text, info["href"])
这给了你:
(u'Aleria S.', '/tutors/online-tutors/Aleria-S-371573/')
(u'Douglas Z.', '/tutors/online-tutors/Douglas-Z-568826/')
(u'Carla S.', '/tutors/online-tutors/Carla-S-864918/')
(u'Vinit R.', '/tutors/online-tutors/Vinit-R-2031766/')
(u'Anastasia G.', '/tutors/online-tutors/Anastasia-G-65278/')
(u'Vinay S.', '/tutors/online-tutors/Vinay-S-85533/')
(u'Gunjan G.', '/tutors/online-tutors/Gunjan-G-2695711/')
(u'Scott M.', '/tutors/online-tutors/Scott-M-277743/')
(u'Saumya U.', '/tutors/online-tutors/Saumya-U-890305/')
(u'Ed M.', '/tutors/online-tutors/Ed-M-2895636/')
没有涉及Javascript,如果您在浏览器中右键单击并选择查看源,您可以看到它就在那里。如果它是动态创建的,则不会在 Microsoft Edge DOM Explorer 之外的源中看到它。通常,添加用户代理总是好的。
head = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"}
soup = BeautifulSoup(requests.get("https://www.chegg.com/tutors/online-tutors/", headers=head).content)