我在python 2.7中创建了一个webcrawler,我正在使用mysqldb将数据插入到数据库中。
(输入种子页面和深度后)
追踪(最近一次通话): 文件" C:\ Users \ Chetan \ Desktop \ webCrawler.py",第207行,在 mainFunc(深度,URL)
文件" C:\ Users \ Chetan \ Desktop \ webCrawler.py",第194行,在mainFunc中 LST = perPage(URL)
文件" C:\ Users \ Chetan \ Desktop \ webCrawler.py",第186行,每页 filterContent(URL,页)
文件" C:\ Users \ Chetan \ Desktop \ webCrawler.py",第149行,在filterContent中 cursor.execute(SQL)
File" C:\ Python27 \ lib \ site-packages \ MySQLdb \ cursors.py",202行,执行中 self.errorhandler(self,exc,value)
文件" C:\ Python27 \ lib \ site-packages \ MySQLdb \ connections.py",第36行,在defaulterrorhandler中
raise errorclass, errorvalue
ProgrammingError:(1064,'您的SQL语法中有错误;请查看与您的MySQL服务器版本相对应的手册,以获得在\和#和特价附近使用的正确语法。&#34 ; /> \ n
<小时/>我似乎无法找到任何问题。这是代码;
def metaContent(page,url):#EXTRACTS META TAG CONTENT
lst=[]
while page.find("<meta")!=-1:
start_link=page.find("<meta")
page=page[start_link:]
start_link=page.find("content=")
start_quote=page.find('"',start_link)
end_quote=page.find('"',start_quote+1)
metaTag=page[start_quote+1:end_quote]
page=page[end_quote:]
lst.append(metaTag)
#ENTER DATA INTO DB
i,j=0,0
while i<len(lst):
sql = "INSERT INTO META(URL, \
KEYWORD) \
VALUES ('%s','%s')" % \
(url,lst[i])
cursor.execute(sql)
db.commit()
def filterContent(page,url):#FILTERS THE CONTENT OF THE REMAINING PORTION
phrase = ['to','a','an','the',"i'm",\
'for','from','that','their',\
'i','my','your','you','mine',\
'we','okay','yes','no','as',\
'if','but','why','can','now',\
'are','is','also']
#CALLS FUNC TO REMOVE HTML TAGS
page = strip_tags(page)
#CONVERT TO LOWERCASE
page = page.lower()
#REMOVES WHITESPACES
page = page.split()
page = " ".join(page)
#REMOVES IDENTICAL WORDS AND COMMON WORDS
page = set(page.split())
page.difference_update(phrase)
#CONVERTS FROM SET TO LIST
lst = list(page)
#ENTER DATA INTO DB
i,j=0,0
while i<len(lst):
sql = "INSERT INTO WORDS(URL, \
KEYWORD) \
VALUES ('%s','%s')" % \
(url,lst[i])
cursor.execute(sql)
db.commit()
#<6>
def perPage(url):#CALLS ALL THE FUNCTIONS
page=pageContent(url)
#REMOVES CONTENT BETWEEN SCRIPT TAGS
flg=0
while page.find("<script",flg)!=-1:
start=page.find("<script",flg)
end=page.find("</script>",flg)
end=end+9
i,k=0,end-start
page=list(page)
while i<k:
page.pop(start)
i=i+1
page=''.join(page)
flg=start
#REMOVES CONTENT BETWEEN STYLE TAGS
flg=0
while page.find("<script",flg)!=-1:
start=page.find("<style",flg)
end=page.find("</style>",flg)
end=end+9
i,k=0,end-start
page=list(page)
while i<k:
page.pop(start)
i=i+1
page=''.join(page)
flg=start
metaContent(url,page)
lst=linksExt(url,page)
filterContent(url,page)
return lst#CHECK WEATHER NEEDED OR NOT
#<7>
crawled=[]
def mainFunc(depth,url):#FOR THE DEPTH MANIPULATION
if (depth):
lst=perPage(url)
crawled.append(url)
i=0
if (depth-1):
while i<len(lst):
if url[i] not in crawled:
mainFunc(depth-1,url[i])
i+=1
#CALLING MAIN FUNCTION
mainFunc(depth,url)
请提及任何错误,尤其是深度操作函数(mainFunc())。有关改进爬虫的任何内容都会有所帮助。
答案 0 :(得分:1)
肯定是sql错误,你的报价没有被转义。
而不是这个
sql = "INSERT INTO META(URL, \
KEYWORD) \
VALUES ('%s','%s')" % \
(url,lst[i])
cursor.execute(sql)
和这个
sql = "INSERT INTO WORDS(URL, \
KEYWORD) \
VALUES ('%s','%s')" % \
(url,lst[i])
cursor.execute(sql)
试试这个
sql = "INSERT INTO WORDS(URL, \
KEYWORD) \
VALUES (%s, %s)"
cursor.execute(sql, (url, lst[i]))
和这个
sql = "INSERT INTO META(URL, \
KEYWORD) \
VALUES (%s, %s)"
cursor.execute(sql, (url, lst[i]))
你也在使用while但不增加i,而是你可以使用这个
for keyword in lst:
sql = "INSERT INTO META(URL, \
KEYWORD) \
VALUES (%s, %s)"
cursor.execute(sql, (url, keyword))
答案 1 :(得分:0)
在mainFunc
的递归调用中,您正在调用main
函数,
main(depth-1,url[i])
您的代码中没有main
功能。
将其更改为,
mainFunc(depth-1,url[i])