Question

我正在构建一个网络抓取工具。我输入数据存储区的一些数据会被保存，其他数据不会被保存，我不知道是什么问题。

这是我的抓取工具类

class Crawler(object):

    def get_page(self, url):
        try:
            req = urllib2.Request(url, headers={'User-Agent': "Magic Browser"}) #  yessss!!! with the header, I am able to download pages
            #response = urlfetch.fetch(url, method='GET')
            #return response.content
        #except urlfetch.InvalidURLError as iu:
         #   return iu.message
            response = urllib2.urlopen(req)
            return response.read()

        except urllib2.HTTPError as e:
            return e.reason


    def get_all_links(self, page):
         return re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',page)


    def union(self, lyst1, lyst2):
        try:
            for elmt in lyst2:
                if elmt not in lyst1:
                    lyst1.append(elmt)
            return lyst1
        except e:
            return e.reason

#function that  crawls the web for links starting from the seed
#returns a dictionary of index and graph
    def crawl_web(self, seed="http://tonaton.com/"):
        query = Listings.query() #create a listings object from storage
        if query.get():
            objListing = query.get()
        else:
            objListing = Listings()
            objListing.toCrawl = [seed]
            objListing.Crawled = []

        start_time = datetime.datetime.now()
        while datetime.datetime.now()-start_time < datetime.timedelta(0,5):#tocrawl (to crawl can take forever)
            try:
                #while True:
                page = objListing.toCrawl.pop()

                if page not in objListing.Crawled:
                    content = self.get_page(page)
                    add_page_to_index(page, content)
                    outlinks = self.get_all_links(content)
                    graph = Graph() #create a graph object with the url
                    graph.url = page
                    graph.links = outlinks #save all outlinks as the value part of the graph url
                    graph.put()

                    self.union(objListing.toCrawl, outlinks)
                    objListing.Crawled.append(page)
            except:
                return False

        objListing.put() #save to database
        return True #return true if it works

定义各种ndb模型的类在这个python模块中：

import os
import urllib
from google.appengine.ext import ndb
import webapp2

class Listings(ndb.Model):
    toCrawl = ndb.StringProperty(repeated=True)
    Crawled = ndb.StringProperty(repeated=True)

#let's see how this works

class Index(ndb.Model):
    keyword = ndb.StringProperty() # keyword part of the index
    url = ndb.StringProperty(repeated=True) # value part of the index

#class Links(ndb.Model):
 #   links = ndb.JsonProperty(indexed=True)

class Graph(ndb.Model):
    url = ndb.StringProperty()
    links = ndb.StringProperty(repeated=True)

当我用JsonProperty代替StringProperty（重复= true）时，它常常工作正常。但JsonProperty限制为1500字节，所以我有一次错误。

现在，当我运行crawl_web成员函数时，它实际上会抓取，但是当我检查数据存储时，它只是创建的Index实体。没有图表，没有列表。请帮忙。感谢。

Answer 1

将代码放在一起，添加缺少的导入并记录异常，最终会显示第一个杀手问题：

Exception Indexed value links must be at most 500 characters

实际上，添加outlinks的日志记录，很容易让人注意到其中有几个远远超过500个字符 - 因此它们不能成为索引属性中的项目，例如{{ 1}}。将每个重复的StringProperty更改为重复的StringProperty（因此它没有被索引，因此没有每个项目500个字符的限制），代码运行一段时间（制作一些{的实例{1}}）但最终死于：

TextProperty

事实上，在所谓的＆＃34;链接＆＃34;中，它非常明显。实际上是一堆Javascript，因此无法获取。

因此，基本上，代码中的核心错误根本不与应用引擎相关，而是问题在于您的正则表达式：

Graph

不在给定包含Javascript和HTML的网页的情况下正确提取外发链接。

您的代码存在许多问题，但到目前为止，他们只是放慢速度或者让它更难理解，而不是杀死它 - 杀死它的是使用正则表达式模式来尝试从页面中提取链接。

查看retrieve links from web page using python and BeautifulSoup - 大多数答案建议，为了从页面中提取链接，使用BeautifulSoup，这可能是应用程序引擎中的一个问题，但是人们展示了如何使用Python和个RE。

无法在数据存储区中保存数据但没有错误

1 个答案: