Question

创建分布式抓取python应用。它由主服务器和将在客户端服务器上运行的关联客户端应用程序组成。客户端应用程序的目的是在目标站点上运行，以提取特定数据。客户需要在网站内“深入”，在多层次的表单背后，因此每个客户都专门针对特定网站。

每个客户端应用程序看起来像

main:

parse initial url

call function level1 (data1)

function level1 (data)
 parse the url, for data1
 use the required xpath to get the dom elements
 call the next function
 call level2 (data)


function level2 (data2)
 parse the url, for data2
 use the required xpath to get the dom elements
 call the next function
 call level3

function level3 (dat3)
 parse the url, for data3
 use the required xpath to get the dom elements
 call the next function
 call level4

function level4 (data)
 parse the url, for data4
 use the required xpath to get the dom elements

 at the final function.. 
 --all the data output, and eventually returned to the server        
 --at this point the data has elements from each function...

我的问题：鉴于对该号码进行的呼叫次数由当前函数的子函数变化，我试图计算出最好的方法。

 each function essentialy fetches a page of content, and then parses 
 the page using a number of different XPath expressions, combined 
 with different regex expressions depending on the site/page.

 if i run a client on a single box, as a sequential process, it'll 
 take awhile, but the load on the box is rather small. i've thought 
 of attempting to implement the child functions as threads from the 
 current function, but that could be a nightmare, as well as quickly 
 bring the "box" to its knees!

 i've thought of breaking the app up in a manner that would allow 
 the master to essentially pass packets to the client boxes, in a 
 way to allow each client/function to be run directly from the 
 master. this process requires a bit of rewrite, but it has a number 
 of advantages. a bunch of redundancy, and speed. it would detect if 
 a section of the process was crashing and restart from that point. 
 but not sure if it would be any faster...

我在python中编写解析脚本..

所以......任何想法/评论都会受到赞赏......

我可以获得更多细节，但不想让任何人生气!!

谢谢！

汤姆

Answer 1

这听起来像是Hadoop上MapReduce的用例。

Hadoop Map / Reduce是一个软件框架，用于轻松编写应用程序，在可靠，容错的大型集群（数千个节点）的商用硬件上并行处理大量数据（多TB数据集）方式。 在您的情况下，这将是一个较小的群集。

Map / Reduce作业通常将输入数据集拆分为独立的块，这些块由地图任务以完全并行的方式处理。

你提到过，

我想过打破这个应用程序一种允许主人的方式实质上将数据包传递给客户端盒，以允许每个客户端/功能直接运行来自大师。

据我所知，您希望主机（盒子）充当主机，并拥有运行其他功能的客户端盒。例如，您可以运行main（）函数并解析其上的初始URL。不错的是，您可以跨不同的计算机并行化每个URL的任务，因为它们看起来彼此独立。

由于level4依赖于level3，它依赖于level2 ..等等，你可以将每个的输出管道输出到下一个而不是从每个中调用一个。

有关如何执行此操作的示例，我建议您按给定的顺序检出以下教程，

The Hadoop tutorial是对map-reduce及其工作原理的简单介绍和概述。
Michael Noll's tutorial关于如何以简单的方式在Python（Mapper和Reducer的基本概念）之上使用Hadoop
最后，由Last.fm的人们发布的a tutorial for a framework called Dumbo，它自动生成并构建在Michael Noll的基本示例上，用于生产系统。

希望这有帮助。

Answer 2

看一下multiprocessing课程。它允许您设置工作队列和工作池 - 当您解析页面时，您可以通过单独的进程生成任务。

Answer 3

查看scrapy包。它可以轻松创建“深入”网站的“客户端应用程序”（a.k.a爬虫，蜘蛛或刮刀）。

brool和viksit都对项目的分布式部分提出了很好的建议。

架构python问题

3 个答案: