Question

这里很新，所以请提前道歉。我希望从https://angel.co/companies获取所有公司描述的列表。我试过的基于网络的解析工具并没有削减它，所以我正在寻找一个简单的python脚本。我应该首先获取所有公司URL的数组然后循环它们吗？任何资源或方向都会有所帮助 - 我已经浏览了BeautifulSoup的文档和一些帖子/视频教程，但我还是在模拟json请求时被挂起（参见其他内容）这里：Get all links with BeautifulSoup from a single page website ('Load More' feature)）

我看到一个我认为正在调用其他列表的脚本：

SELECT piadi, debtorfullName, ClientAddress, mobilePhone, principalAmount, totalAmount, status, product_name,comment ,postdate 
FROM portfeli_0 INNER JOIN
     komentarebi 
     ON portfeli_0.piadi=komentarebi.person_id 
WHERE portfeli_0.user = '$user'

UNION 

SELECT piadi, debtorfullName, ClientAddress, mobilePhone, principalAmount, totalAmount, status, product_name,comment ,postdate 
FROM portfeli_1 INNER JOIN 
     komentarebi 
     ON portfeli_1.piadi=komentarebi.person_id 
WHERE portfeli_1.user = '$user' ;

谢谢！

Answer 1

您要抓取的数据是使用ajax动态加载的，您需要做很多工作才能获得您真正想要的HTML：

import requests
from bs4 import BeautifulSoup

header = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
    "X-Requested-With": "XMLHttpRequest",
    }

with requests.Session() as s:
    r = s.get("https://angel.co/companies").content
    csrf = BeautifulSoup(r).select_one("meta[name=csrf-token]")["content"]
    header["X-CSRF-Token"] = csrf
    ids = s.post("https://angel.co/company_filters/search_data", data={"sort": "signal"}, headers=header).json()
    _ids = "".join(["ids%5B%5D={}&".format(i)  for i in ids.pop("ids")])
    rest = "&".join(["{}={}".format(k,v) for k,v in ids.items()])
    url = "https://angel.co/companies/startups?{}{}".format(_ids, rest)
    rsp = s.get(url, headers=header)
    print(rsp.json())

我们首先需要获得一个有效的csrf-token，这是初始请求所做的，然后我们需要发布到https://angel.co/company_filters/search_data：

给了我们：

{"ids":[296769,297064,60,63,112,119,130,160,167,179,194,236,281,287,312,390,433,469,496,516],"total":908164,"page":1,"sort":"signal","new":false,"hexdigest":"3f4980479bd6dca37e485c80d415e848a57c43ae"}

它们是我们到达https://angel.co/companies/startups所需的参数，即我们的最后一个请求：

然后，该请求为我们提供了更多json，其中包含html和所有公司信息：

{"html":"<div class=\" dc59 frs86 _a _jm\" data-_tn=\"companies/results ...........

发布的内容太多了，但这就是你需要解析的内容。

所以把它们放在一起：

In [3]: header = {
   ...:     "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
   ...:     "X-Requested-With": "XMLHttpRequest",
   ...: }

In [4]: with requests.Session() as s:
   ...:         r = s.get("https://angel.co/companies").content
   ...:         csrf = BeautifulSoup(r, "lxml").select_one("meta[name=csrf-token]")["content"]
   ...:         header["X-CSRF-Token"] = csrf
   ...:         ids = s.post("https://angel.co/company_filters/search_data", data={"sort": "signal"}, headers=header).json()
   ...:         _ids = "".join(["ids%5B%5D={}&".format(i) for i in ids.pop("ids")])
   ...:         rest = "&".join(["{}={}".format(k, v) for k, v in ids.items()])
   ...:         url = "https://angel.co/companies/startups?{}{}".format(_ids, rest)
   ...:         rsp = s.get(url, headers=header)
   ...:         soup = BeautifulSoup(rsp.json()["html"], "lxml")
   ...:         for comp in soup.select("div.base.startup"):
   ...:                 text = comp.select_one("div.text")
   ...:                 print(text.select_one("div.name").text.strip())
   ...:                 print(text.select_one("div.pitch").text.strip())
   ...:         
Frontback
Me, now.
Outbound
Optimizely for messages
Adaptly
The Easiest Way to Advertise Across The Social Web.
Draft
Words with Friends for Fantasy (w/ real money)
Graphicly
an automated ebook publishing and distribution platform
Appstores
App Distribution Platform
eVenues
Online Marketplace & Booking Engine for Unique Meeting Spaces
WePow
Video & Mobile Recruitment
DoubleDutch
Event Marketing Automation Software
ecomom
It's all good
BackType
Acquired by Twitter
Stipple
Native advertising for the visual web
Pinterest
A Universal Social Catalog
Socialize
Identify and reward your most influential users with our drop-in social platform.
StyleSeat
Largest and fastest growing marketplace in the $400B beauty and wellness industry
LawPivot
99 Designs for legal
Ostrovok
Leading hotel booking platform for Russian-speakers
Thumb
Leading mobile social network that helps people get instant opinions
AppFog
Making developing applications on the cloud easier than ever before
Artsy
Making all the world’s art accessible to anyone with an Internet connection.

就分页而言，每天限制为20页，但要获得所有20页，只需将page:page_no添加到我们的表单数据中即可获得所需的新参数{{1} }，当你点击加载更多时，你可以看到发布的内容：

所以最后的代码：

data={"sort": "signal","page":page}

显然，您解析的内容取决于您，但您在浏览器中看到的所有结果都将可用。

列表的BeautifulSoup子页面带有＆＃34;加载更多＆＃34;分页

1 个答案: