我正在从网站上抓取数据。我发现表数据在页面的源代码中显示为正在加载。我想知道如何使用python收集数据。这似乎是一个React js网络应用。
答案 0 :(得分:1)
在XHR下找不到它作为请求,因此您可以使用Selenium,它将允许页面呈现,然后使用熊猫抓取表格:
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
url = 'https://www.ycombinator.com/companies/'
driver.get(url)
df = pd.read_html(driver.page_source)[0]
driver.close()
输出:
print (df)
[ 0 1 2
0 Actiondesk s2019 Google Sheets meets Zapier. Actiondesk lets no...
1 Alana s2019 Helping large companies in LATAM hire blue-col...
2 Apero Health s2019 Modern medical billing.
3 Apurata s2019 Small loans for the Latin American middle clas...
4 Arpeggio Bio s2019 Arpeggio builds technology to watch and learn ...
5 Asayer s2019 Asayer is a session replay tool for developers...
6 Asher Bio s2019 We build better immunotherapies
7 AudioFocus s2019 NaN
8 Axite Labs s2019 A modern IP licensing platform to accelerate t...
9 basis s2019 Software to automate construction workflows, s...
10 Beacons AI s2019 Helping creators monetize through short video ...
11 Binks s2019 Binks is a chain of trusted micro-boutiques th...
12 Blair s2019 Financing college education through Income Sha...
13 Boost Biomes s2019 NaN
14 Bouncer s2019 SDK for scanning and verifying credit cards an...
15 Brave Care s2019 Modern healthcare for kids. We do that with a ...
16 Breadfast s2019 Breadfast delivers fresh bread, milk and eggs ...
17 BuildStream s2019 A market network for industrial labor
18 Business Score s2019 Connecting startups with the things they need.
19 Canix s2019 Canix makes it easy to get and stay compliant ...
20 Carry s2019 Carry plans, books, and supports corporate tra...
21 Carve s2019 NaN
22 Cloosiv s2019 Cloosiv is an order-ahead app for independent ...
23 Coco s2019 The Venezuelan Instacart - allowing Venezuelan...
24 CoLab Software s2019 Jira for Mechanical Engineering Teams
25 Compound s2019 Compound helps people who work at startups und...
26 Courier s2019 Send your product's user notifications to the ...
27 Covela s2019 The digital insurance broker for SMEs in LATAM
28 Cuboh s2019 Cuboh helps restaurants use several delivery p...
29 Curri s2019 We provide on-demand material delivery for the...
... ... ...
2009 Zenter w2007 NaN
2010 Jamglue s2006 NaN
2011 Jumpchat s2006 NaN
2012 Likebetter s2006 NaN
2013 Omgpop s2006 NaN
2014 Pollground s2006 Online polls.
2015 Scribd s2006 World's largest online library.
2016 Shoutfit s2006 NaN
2017 Talkito s2006 NaN
2018 Thinkature s2006 NaN
2019 Xobni s2006 NaN
2020 Zanbazaar s2006 NaN
2021 Audiobeta w2006 NaN
2022 Clustrix w2006 NaN
2023 Flagr w2006 NaN
2024 Inkling w2006 NaN
2025 Project Wedding w2006 NaN
2026 Snipshot w2006 We sold Snipshot to Ansa in 2013.
2027 Wufoo w2006 Online form builder.
2028 Airtime s2005 NaN
2029 Clickfacts s2005 NaN
2030 Infogami s2005 NaN
2031 Kiko s2005 We're the best online calendar solution to eve...
2032 Loopt s2005 NaN
2033 Memamp s2005 NaN
2034 Parakey s2005 NaN
2035 Posthaven s2005 Blogging forever
2036 Reddit s2005 The frontpage of the internet.
2037 Simmery s2005 NaN
2038 TextPayMe s2005 NaN
[2039 rows x 3 columns]]
答案 1 :(得分:1)
如果转到“网络”选项卡,则会在下面的API中找到,该API以json格式返回数据。
您不需要selenium
或beautifulsoup
。
这是下面的代码。
import requests
res=requests.get("https://api.ycombinator.com/companies/export.json?").json()
for item in res:
try:
print('name:' + item['name'])
except:
continue
try:
print('URL:' + item['url'])
except:
continue
try:
print('batch:' + item['batch'])
except:
continue
try:
print('Description:' + item['description'])
except:
continue
API快照
响应: