有没有一种方法可以抓取使用python加载的数据

时间:2019-12-20 12:29:52

标签: python web-scraping beautifulsoup scrapy

我正在从网站上抓取数据。我发现表数据在页面的源代码中显示为正在加载。我想知道如何使用python收集数据。这似乎是一个React js网络应用。

  

URL:https://www.ycombinator.com/companies/

2 个答案:

答案 0 :(得分:1)

在XHR下找不到它作为请求,因此您可以使用Selenium,它将允许页面呈现,然后使用熊猫抓取表格:

from selenium import webdriver
import pandas as pd

driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')

url = 'https://www.ycombinator.com/companies/'
driver.get(url)

df = pd.read_html(driver.page_source)[0]

driver.close()

输出:

print (df)
[                    0      1                                                  2
0          Actiondesk  s2019  Google Sheets meets Zapier. Actiondesk lets no...
1               Alana  s2019  Helping large companies in LATAM hire blue-col...
2        Apero Health  s2019                            Modern medical billing.
3             Apurata  s2019  Small loans for the Latin American middle clas...
4        Arpeggio Bio  s2019  Arpeggio builds technology to watch and learn ...
5              Asayer  s2019  Asayer is a session replay tool for developers...
6           Asher Bio  s2019                    We build better immunotherapies
7          AudioFocus  s2019                                                NaN
8          Axite Labs  s2019  A modern IP licensing platform to accelerate t...
9               basis  s2019  Software to automate construction workflows, s...
10         Beacons AI  s2019  Helping creators monetize through short video ...
11              Binks  s2019  Binks is a chain of trusted micro-boutiques th...
12              Blair  s2019  Financing college education through Income Sha...
13       Boost Biomes  s2019                                                NaN
14            Bouncer  s2019  SDK for scanning and verifying credit cards an...
15         Brave Care  s2019  Modern healthcare for kids. We do that with a ...
16          Breadfast  s2019  Breadfast delivers fresh bread, milk and eggs ...
17        BuildStream  s2019              A market network for industrial labor
18     Business Score  s2019     Connecting startups with the things they need.
19              Canix  s2019  Canix makes it easy to get and stay compliant ...
20              Carry  s2019  Carry plans, books, and supports corporate tra...
21              Carve  s2019                                                NaN
22            Cloosiv  s2019  Cloosiv is an order-ahead app for independent ...
23               Coco  s2019  The Venezuelan Instacart - allowing Venezuelan...
24     CoLab Software  s2019              Jira for Mechanical Engineering Teams
25           Compound  s2019  Compound helps people who work at startups und...
26            Courier  s2019  Send your product's user notifications to the ...
27             Covela  s2019     The digital insurance broker for SMEs in LATAM
28              Cuboh  s2019  Cuboh helps restaurants use several delivery p...
29              Curri  s2019  We provide on-demand material delivery for the...
              ...    ...                                                ...
2009           Zenter  w2007                                                NaN
2010          Jamglue  s2006                                                NaN
2011         Jumpchat  s2006                                                NaN
2012       Likebetter  s2006                                                NaN
2013           Omgpop  s2006                                                NaN
2014       Pollground  s2006                                      Online polls.
2015           Scribd  s2006                    World's largest online library.
2016         Shoutfit  s2006                                                NaN
2017          Talkito  s2006                                                NaN
2018       Thinkature  s2006                                                NaN
2019            Xobni  s2006                                                NaN
2020        Zanbazaar  s2006                                                NaN
2021        Audiobeta  w2006                                                NaN
2022         Clustrix  w2006                                                NaN
2023            Flagr  w2006                                                NaN
2024          Inkling  w2006                                                NaN
2025  Project Wedding  w2006                                                NaN
2026         Snipshot  w2006                  We sold Snipshot to Ansa in 2013.
2027            Wufoo  w2006                               Online form builder.
2028          Airtime  s2005                                                NaN
2029       Clickfacts  s2005                                                NaN
2030         Infogami  s2005                                                NaN
2031             Kiko  s2005  We're the best online calendar solution to eve...
2032            Loopt  s2005                                                NaN
2033           Memamp  s2005                                                NaN
2034          Parakey  s2005                                                NaN
2035        Posthaven  s2005                                   Blogging forever
2036           Reddit  s2005                     The frontpage of the internet.
2037          Simmery  s2005                                                NaN
2038        TextPayMe  s2005                                                NaN

[2039 rows x 3 columns]]

答案 1 :(得分:1)

如果转到“网络”选项卡,则会在下面的API中找到,该API以json格式返回数据。 您不需要seleniumbeautifulsoup

  

https://api.ycombinator.com/companies/export.json

这是下面的代码。

import requests
res=requests.get("https://api.ycombinator.com/companies/export.json?").json()
for item in res:
    try:
      print('name:' + item['name'])
    except:
        continue
    try:
      print('URL:' + item['url'])
    except:
        continue

    try:
        print('batch:' + item['batch'])

    except:
        continue

    try:
        print('Description:' + item['description'])
    except:
        continue

API快照

enter image description here

响应

enter image description here