Question

我们正在尝试确定哪些是用于查找使用特定JS API /服务的网站抓取网页的最佳策略/工具。

例如，我们想确定使用Google Analytics的网站数量。

当然，我们可以检查一下UA-XXX-XX变量的存在，但是，如果我们想查找使用Disqus的网站，那就行不通了...等我们宁可跑无头浏览器，查看与www.google-analytics.com建立网络连接的页面。

最好的策略是什么？

Answer 1

有三种方法可以做到这一点。

实施网络抓取工具或窥探浏览器或网络
欺骗现有的抓取工具放弃此信息
从已经收集它的人那里获取它。

你问的是＃1，但让我先解决其他问题。

谷歌的搜索允许link:google-analytics.com之类的内容搜索提供google-analytics.com链接的网页，但是it turns up zero results，我猜link:指的是<a href="*">一个grep锚点而不是加载页面的一个组件。所以看起来谷歌在这方面没有帮助。

也许在例如搜索代码GitHub会给你一些信息。这并没有为您提供全面的网络视图，但它确实提供了一个窗口。

Ghostery制作了privacy add-on for browsers，通过利用其选择向其服务器发送数据的用户来赚钱。然后他们将这些数据出售给公司，以便他们可以看到他们如何违反自己的隐私政策（这是一种过度简化）。这意味着Ghostery has this information并将其作为Marketing Cloud Management服务的一部分出售。

当你不必要时，不要重新发明轮子！ Ghostery拥有这些数据。 Alexa，BuiltWith和Pingdom也是如此（如维基百科Google Analytics Popularity部分所述）。

至于自己这样做，这取决于你的资源。如果您运行公司网络，您可能只能使用Snort之类的内容来嗅探数据包，或者（如果您有更多控制权）使用Squid这样的缓存透明Web代理。如果你有一个用户群，你可以写一个类似于Ghostery的浏览器插件（至少从数据收集方面）。

否则你必须实现自己的抓取工具（有several to choose from），然后让它在网络上松散，不过你可能需要为你正在寻找的东西写下签名。你可能不想执行任意的javascript。

我想你和他们中的任何一个都相距甚远。在每个Alexa Top 500网站中搜索一个或两个级别应该是一个非常好的开始。

一旦您开始收集数据（通过您选择的任何方法），您就需要对其进行处理，以便在磁盘空间不足之前将其删除。我认为一个预先定义的正则表达式列表（基本上是{{1}}调用）来匹配您正在寻找的项目可能就足够了。理货并删除。

Answer 2

为此，我将使用python scrapy并添加一个中间件来scrapy将所有页面请求发送到selenium网格。然后将使用selenium请求和解析页面，您将能够使用和访问python scrapy中的数据。

使用：

https://github.com/brady-vitrano/dsgrid

http://scrapy.org/

这是一个scrapy selenium中间件，它将使用dsgrid发送所有请求，并允许您使用xpath正常处理scrapy中处理过的页面。您还可以创建一个中间件，只有在看到特定域时才会执行操作，以及其他很酷的东西。这样你也可以使用PhantomJS抓取页面并在python scrapy或任何其他驱动程序中与它们进行交互。

我很久以前写过这篇文章并且暂时没有使用它，但我记得它对我来说非常合适。

from scrapy.http import HtmlResponse
from scrapy.conf import settings

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.phantomjs.service import Service as PhantomJSService

# from pyvirtualdisplay import Display
# display = Display(visible=0, size=(800, 600))
# display.start()

selenium_grid_url = "http://172.17.0.2:4444/wd/hub"

reuseable_driver = None

phantomjs_path = '/usr/bin/phantomjs'


class WebDriverProxy():
    def chrome_driver(self):
        pass

    def firefox_driver(self):
        pass

    def phantomjs_driver(self):
        # monkey patch Service temporarily to include desired args
        class NewService(PhantomJSService):
            def __init__(self, *args, **kwargs):
                service_args = kwargs.get('service_args', list())
                proxy = '--proxy=127.0.0.1:9050'
                proxytype = '--proxy-type=socks5'
                if service_args is not None:
                    service_args += [
                        proxy,
                        proxytype,
                    ]
                else:
                    service_args = [
                        proxy,
                        proxytype,
                    ]
                super(NewService, self).__init__(*args, **kwargs)
        webdriver.phantomjs.webdriver.Service = NewService
        # init the webdriver
        driver = webdriver.PhantomJS(phantomjs_path)
        # undo monkey patch
        webdriver.phantomjs.webdriver.Service = PhantomJSService
        return driver


def get_driver(driver_type=None, implicitly_wait=10):
    if reuseable_driver is not None:
        driver = reuseable_driver
    else:
        if driver_type:
            driver = None

            # capabilities = DesiredCapabilities.FIREFOX.copy()

            if not driver and 'phantomjs' in driver_type.lower():
                driver = WebDriverProxy().phantomjs_driver()
                # driver = webdriver.PhantomJS()
                # capabilities = DesiredCapabilities.PHANTOMJS.copy()
            if not driver and 'firefox' in driver_type.lower():
                driver = webdriver.Firefox()
                # capabilities = DesiredCapabilities.FIREFOX.copy()
            if not driver and 'chrome' in driver_type.lower():
                driver = webdriver.Chrome()
                # capabilities = DesiredCapabilities.CHROME.copy()
            if not driver:
                driver = webdriver.PhantomJS()

            # driver = webdriver.Remote(desired_capabilities=capabilities, command_executor=selenium_grid_url)
        else:
            driver = webdriver.PhantomJS()

        driver.implicitly_wait(implicitly_wait)

    return driver


def ajax_complete(driver):
    try:
        return 0 == driver.execute_script("return jQuery.active")
    except WebDriverException:
        pass


def driver_wait(driver, itime, callback):
    #wait for ajax items to load
    WebDriverWait(driver, itime).until(callback,)

    assert "ajax loaded string" in driver.page_source


class SeleniumDownloaderMiddleware(object):
    def process_request(self, request, spider):
        if hasattr(spider, 'use_selenium') and spider.use_selenium:
            #check if spider has driver defined if not just use from settings
            if hasattr(spider, 'selenium_driver'):
                driver_type = getattr(spider, 'selenium_driver', settings.get('SELENIUM_WEBDRIVER'))
            else:
                driver_type = settings.get('SELENIUM_WEBDRIVER')

            driver = get_driver(driver_type=driver_type)

            driver.get(request.url)

            #set that this request was made with selenium driver
            request.meta['is_selenium'] = True

            html = driver.execute_script('return document.documentElement.innerHTML;')
            if '</pre></body>' in html:
                if 'head><body><pre' in html:
                    if '<head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">' in html[0:78]:
                        html = html.replace('<head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">', '')
                    elif '</head><body><pre>' in html:
                        html = html.split('</head><body><pre>')[1]
                if '</pre></body>' in html[-13:]:
                    html = html.replace('</pre></body>', '')

            url = driver.execute_script('return window.location.href;')

            response = HtmlResponse(
                url=url,
                encoding='utf-8',
                body=html.encode('utf-8'),
                request=request,
            )
            #close browser if we say in spider to close it
            if hasattr(spider, 'selenium_close_driver') and\
                    getattr(spider, 'selenium_close_driver') is True:
                reuseable_driver = None
                driver.close()

            return response
        return None

就像上面的SeleniumDownloaderMiddleware一样，您可以创建另一个名为GoogleAnalyticsCountMiddleware的中间件，它会在响应中看到它是针对www.google-analytics.com并递增计数器或解析UA-XXX-XX并增加计数在数组或数据库中（SQLAlchemy）。

然后，您可以通过在Scrapy中实施更多中间件并使用您创建的DatabaseMiddleware或APIMiddleware将数据保存到您的数据库，以便与您为接受这些数据而构建的API进行通信。

对使用特定JS API的网站进行爬网

2 个答案: