import sys

sys.setrecursionlimit(1500) #This increases the recursion limit, ultimately moving
#up the ceiling on the stack so it doesn't overflow.

查看此帖子了解详情:What is the maximum recursion depth in Python, and how to increase it?


我正在抓日期的网页。截至目前,我已经成功地使用re.findall以我正在搜索的格式提取日期,但是一旦我了解了第33个链接,我得到“调用Python对象时超出了最大递归深度”错误,它一直指向 dates = re.findall(regex,str(网页))对象。


import urllib2
from bs4 import BeautifulSoup as BS
import re

#All code is correct between imports and the start of the For loop

for url in URLs:

    #Open and read the URL and specify html.parser as the parsing agent so that the parsing method remains uniform across systems
    webpage = BS(urllib2.urlopen(req).read(), "html.parser")

    #Create a list to store the dates to be searched
    regex = []

    #Append to a list those dates that have the end year "2011"
    regex.append("((?:January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Sept|Oct|Nov|Dec)[\.]*[,]*[ ](?:0?[1-9]|[12][0-9]|3[01])[,|\.][ ](?:(?:20|'|`)[1][1]))")

    #Join all the dates matched on the webpage from the regex by a comma
    regex = ','.join(regex)

    #Find the matching date format from the opened webpage 
    #[Recursion depth error happens here]
    dates = re.findall(regex, str(webpage))

    #If there aren't any dates that match, then go to the next link
    if dates == []:
        print "There was no matching date found in row " + CurrentRow
        j += 1

    #Print the dates that match the RegEx and the row that they are on
    print "A date was found in the link at row " + CurrentRow
    print dates
    j += 1

继续我的评论,你可以做的是创建许多不同的模式并迭代每个模式,而不是使用一个具有许多不同regex = "January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Sept|Oct|Nov|Dec" regex = ["((?:"+month+")[\.]*[,]*[ ](?:0?[1-9]|[12][0-9]|3[01])[,|\.][ ](?:(?:20|'|`)[1][1]))" for month in regex.split("|")] matches = [] for pattern in regex: matches.append(re.findall(pattern, str(webpage)) 语句的模式。这样的事情可能有用:


这是一种更加迭代的方式,但这非常慢。这是因为它将每个月运行re.findall类型EVERY SINGLE WEBPAGE。正如您所看到的,如果您在问题中至少有33个链接,那么这将是re.findall {{1}}次运行。另外,我不是任何方式的python专家,我甚至不能完全确定这个解决方案会完全摆脱你的问题。