出于学术和性能的考虑,考虑到这种爬行递归式网页抓取功能(仅在给定域内抓取),哪种方法可以使迭代运行?目前,当它运行完毕时,python已经攀升到使用超过1GB的内存,而这在共享环境中运行是不可接受的。
def crawl(self, url):
"Get all URLS from which to scrape categories."
try:
links = BeautifulSoup(urllib2.urlopen(url)).findAll(Crawler._match_tag)
except urllib2.HTTPError:
return
for link in links:
for attr in link.attrs:
if Crawler._match_attr(attr):
if Crawler._is_category(attr):
pass
elif attr[1] not in self._crawled:
self._crawled.append(attr[1])
self.crawl(attr[1])
答案 0 :(得分:12)
使用BFS而不是递归爬行(DFS):http://en.wikipedia.org/wiki/Breadth_first_search
您可以将外部存储解决方案(例如数据库)用于BFS队列以释放RAM。
算法是:
//pseudocode:
var urlsToVisit = new Queue(); // Could be a queue (BFS) or stack(DFS). (probably with a database backing or something).
var visitedUrls = new Set(); // List of visited URLs.
// initialization:
urlsToVisit.Add( rootUrl );
while(urlsToVisit.Count > 0) {
var nextUrl = urlsToVisit.FetchAndRemoveNextUrl();
var page = FetchPage(nextUrl);
ProcessPage(page);
visitedUrls.Add(nextUrl);
var links = ParseLinks(page);
foreach (var link in links)
if (!visitedUrls.Contains(link))
urlsToVisit.Add(link);
}
答案 1 :(得分:5)
您可以将新URL抓取到队列中,而不是递归。然后运行直到队列为空而不递归。如果将队列放入文件中,则几乎不使用任何内存。
答案 2 :(得分:2)
@Mehrdad - 感谢您的回复,您提供的示例简洁易懂。
解决方案:
def crawl(self, url):
urls = Queue(-1)
_crawled = []
urls.put(url)
while not urls.empty():
url = urls.get()
try:
links = BeautifulSoup(urllib2.urlopen(url)).findAll(Crawler._match_tag)
except urllib2.HTTPError:
continue
for link in links:
for attr in link.attrs:
if Crawler._match_attr(attr):
if Crawler._is_category(attr):
continue
else:
Crawler._visit(attr[1])
if attr[1] not in _crawled:
urls.put(attr[1])
答案 3 :(得分:0)
只需将links
用作队列即可轻松完成:
def get_links(url):
"Extract all matching links from a url"
try:
links = BeautifulSoup(urllib2.urlopen(url)).findAll(Crawler._match_tag)
except urllib2.HTTPError:
return []
def crawl(self, url):
"Get all URLS from which to scrape categories."
links = get_links(url)
while len(links) > 0:
link = links.pop()
for attr in link.attrs:
if Crawler._match_attr(attr):
if Crawler._is_category(attr):
pass
elif attr[1] not in self._crawled:
self._crawled.append(attr[1])
# prepend the new links to the queue
links = get_links(attr[1]) + links
当然,这并不能解决内存问题......