调用Python对象时超出了最大递归深度(递归到迭代)

时间:2015-10-26 16:22:08

标签: python regex recursion iteration

编辑:想出来。我刚做了以下几件事:

import sys

sys.setrecursionlimit(1500) #This increases the recursion limit, ultimately moving
#up the ceiling on the stack so it doesn't overflow.

查看此帖子了解详情:What is the maximum recursion depth in Python, and how to increase it?

--------------原始问题-----------------

我正在抓日期的网页。截至目前,我已经成功地使用re.findall以我正在搜索的格式提取日期,但是一旦我了解了第33个链接,我得到“调用Python对象时超出了最大递归深度”错误,它一直指向 dates = re.findall(regex,str(网页))对象。

从我读过的内容来看,我需要在我的代码中使用一个循环,以便我可以摆脱递归,但作为一个新手,我不确定如何更改处理的代码片段RegEx和re.findall从递归到迭代。提前感谢任何见解。

import urllib2
from bs4 import BeautifulSoup as BS
import re

#All code is correct between imports and the start of the For loop

for url in URLs:
    ...

    #Open and read the URL and specify html.parser as the parsing agent so that the parsing method remains uniform across systems
    webpage = BS(urllib2.urlopen(req).read(), "html.parser")

    #Create a list to store the dates to be searched
    regex = []

    #Append to a list those dates that have the end year "2011"
    regex.append("((?:January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Sept|Oct|Nov|Dec)[\.]*[,]*[ ](?:0?[1-9]|[12][0-9]|3[01])[,|\.][ ](?:(?:20|'|`)[1][1]))")

    #Join all the dates matched on the webpage from the regex by a comma
    regex = ','.join(regex)

    #Find the matching date format from the opened webpage 
    #[Recursion depth error happens here]
    dates = re.findall(regex, str(webpage))

    #If there aren't any dates that match, then go to the next link
    if dates == []:
        print "There was no matching date found in row " + CurrentRow
        j += 1
        continue

    #Print the dates that match the RegEx and the row that they are on
    print "A date was found in the link at row " + CurrentRow
    print dates
    j += 1

2 个答案:

答案 0 :(得分:0)

我不认为

regex.append("...")

正在做您认为应该做的事情。

然后调用append方法,regex现在是一个保存正则表达式的单元素数组。以下连接向我表明您认为它应该是一个多元素数组。

修复后,我怀疑您的代码会更好用。

答案 1 :(得分:0)

继续我的评论,你可以做的是创建许多不同的模式并迭代每个模式,而不是使用一个具有许多不同regex = "January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Sept|Oct|Nov|Dec" regex = ["((?:"+month+")[\.]*[,]*[ ](?:0?[1-9]|[12][0-9]|3[01])[,|\.][ ](?:(?:20|'|`)[1][1]))" for month in regex.split("|")] matches = [] for pattern in regex: matches.append(re.findall(pattern, str(webpage)) 语句的模式。这样的事情可能有用:

24*33

这是一种更加迭代的方式,但这非常慢。这是因为它将每个月运行re.findall类型EVERY SINGLE WEBPAGE。正如您所看到的,如果您在问题中至少有33个链接,那么这将是re.findall {{1}}次运行。另外,我不是任何方式的python专家,我甚至不能完全确定这个解决方案会完全摆脱你的问题。