我已经将每个函数作为不同的网页执行不同的脚本，但是在我将它们作为函数放入单个文件后，程序显示错误;

（输入种子页面和深度后）

追踪（最近一次通话）：文件＆＃34; C：\ Users \ Chetan \ Desktop \ webCrawler.py＆＃34;，第207行，在 mainFunc（深度，URL）

文件＆＃34; C：\ Users \ Chetan \ Desktop \ webCrawler.py＆＃34;，第194行，在mainFunc中 LST = perPage（URL）

文件＆＃34; C：\ Users \ Chetan \ Desktop \ webCrawler.py＆＃34;，第186行，每页 filterContent（URL，页）

文件＆＃34; C：\ Users \ Chetan \ Desktop \ webCrawler.py＆＃34;，第149行，在filterContent中 cursor.execute（SQL）

File＆＃34; C：\ Python27 \ lib \ site-packages \ MySQLdb \ cursors.py＆＃34;，202行，执行中 self.errorhandler（self，exc，value）

文件＆＃34; C：\ Python27 \ lib \ site-packages \ MySQLdb \ connections.py＆＃34;，第36行，在defaulterrorhandler中

raise errorclass, errorvalue

ProgrammingError：（1064，＆＃39;您的SQL语法中有错误;请查看与您的MySQL服务器版本相对应的手册，以获得在\和＃和特价附近使用的正确语法。＆＃34 ; /＆gt; \ n

<小时/>

我似乎无法找到任何问题。这是代码;

def metaContent(page,url):#EXTRACTS META TAG CONTENT
    lst=[]
    while page.find("<meta")!=-1:
            start_link=page.find("<meta")
            page=page[start_link:]
            start_link=page.find("content=")
            start_quote=page.find('"',start_link)
            end_quote=page.find('"',start_quote+1)
            metaTag=page[start_quote+1:end_quote]
            page=page[end_quote:]
            lst.append(metaTag)

    #ENTER DATA INTO DB
    i,j=0,0
    while i<len(lst):
        sql = "INSERT INTO META(URL, \
               KEYWORD) \
               VALUES ('%s','%s')" % \
               (url,lst[i])
        cursor.execute(sql)
    db.commit()

def filterContent(page,url):#FILTERS THE CONTENT OF THE REMAINING PORTION
    phrase = ['to','a','an','the',"i'm",\
        'for','from','that','their',\
        'i','my','your','you','mine',\
        'we','okay','yes','no','as',\
        'if','but','why','can','now',\
        'are','is','also']

    #CALLS FUNC TO REMOVE HTML TAGS
    page = strip_tags(page)

    #CONVERT TO LOWERCASE
    page = page.lower()

    #REMOVES WHITESPACES
    page = page.split()
    page = " ".join(page)

    #REMOVES IDENTICAL WORDS AND COMMON WORDS
    page = set(page.split())
    page.difference_update(phrase)

    #CONVERTS FROM SET TO LIST
    lst = list(page)

    #ENTER DATA INTO DB
    i,j=0,0
    while i<len(lst):
        sql = "INSERT INTO WORDS(URL, \
               KEYWORD) \
               VALUES ('%s','%s')" % \
               (url,lst[i])
        cursor.execute(sql)
    db.commit()


#<6>
def perPage(url):#CALLS ALL THE FUNCTIONS
    page=pageContent(url)

    #REMOVES CONTENT BETWEEN SCRIPT TAGS
    flg=0
    while page.find("<script",flg)!=-1:
            start=page.find("<script",flg)
            end=page.find("</script>",flg)
            end=end+9
            i,k=0,end-start
            page=list(page)
            while i<k:
                    page.pop(start)
                    i=i+1
            page=''.join(page)
            flg=start
    #REMOVES CONTENT BETWEEN STYLE TAGS
    flg=0
    while page.find("<script",flg)!=-1:
            start=page.find("<style",flg)
            end=page.find("</style>",flg)
            end=end+9
            i,k=0,end-start
            page=list(page)
            while i<k:
                    page.pop(start)
                    i=i+1
            page=''.join(page)
            flg=start

    metaContent(url,page)
    lst=linksExt(url,page)
    filterContent(url,page)
    return lst#CHECK WEATHER NEEDED OR NOT


#<7>
crawled=[]
def mainFunc(depth,url):#FOR THE DEPTH MANIPULATION
    if (depth):
        lst=perPage(url)
        crawled.append(url)
        i=0
        if (depth-1):
            while i<len(lst):
                if url[i] not in crawled:
                    mainFunc(depth-1,url[i])
                i+=1

#CALLING MAIN FUNCTION
mainFunc(depth,url)

请提及任何错误，尤其是深度操作函数（mainFunc（））。有关改进爬虫的任何内容都会有所帮助。

Answer 1

肯定是sql错误，你的报价没有被转义。

而不是这个

sql = "INSERT INTO META(URL, \
           KEYWORD) \
           VALUES ('%s','%s')" % \
           (url,lst[i])
cursor.execute(sql)

和这个

sql = "INSERT INTO WORDS(URL, \
           KEYWORD) \
           VALUES ('%s','%s')" % \
           (url,lst[i])
cursor.execute(sql)

试试这个

sql = "INSERT INTO WORDS(URL, \
           KEYWORD) \
           VALUES (%s, %s)"
cursor.execute(sql, (url, lst[i]))

和这个

sql = "INSERT INTO META(URL, \
           KEYWORD) \
           VALUES (%s, %s)"
cursor.execute(sql, (url, lst[i]))

你也在使用while但不增加i，而是你可以使用这个

for keyword in lst:
    sql = "INSERT INTO META(URL, \
           KEYWORD) \
           VALUES (%s, %s)"
    cursor.execute(sql, (url, keyword))

Answer 2

在mainFunc的递归调用中，您正在调用main函数，

main(depth-1,url[i])

您的代码中没有main功能。

将其更改为，

mainFunc(depth-1,url[i])

网络爬虫无法正常工作

我已经将每个函数作为不同的网页执行不同的脚本，但是在我将它们作为函数放入单个文件后，程序显示错误;

2 个答案: