Question

编辑：想出来。我刚做了以下几件事：

import sys

sys.setrecursionlimit(1500) #This increases the recursion limit, ultimately moving
#up the ceiling on the stack so it doesn't overflow.

查看此帖子了解详情：What is the maximum recursion depth in Python, and how to increase it?

--------------原始问题-----------------

我正在抓日期的网页。截至目前，我已经成功地使用re.findall以我正在搜索的格式提取日期，但是一旦我了解了第33个链接，我得到“调用Python对象时超出了最大递归深度”错误，它一直指向 dates = re.findall（regex，str（网页））对象。

从我读过的内容来看，我需要在我的代码中使用一个循环，以便我可以摆脱递归，但作为一个新手，我不确定如何更改处理的代码片段RegEx和re.findall从递归到迭代。提前感谢任何见解。

import urllib2
from bs4 import BeautifulSoup as BS
import re

#All code is correct between imports and the start of the For loop

for url in URLs:
    ...

    #Open and read the URL and specify html.parser as the parsing agent so that the parsing method remains uniform across systems
    webpage = BS(urllib2.urlopen(req).read(), "html.parser")

    #Create a list to store the dates to be searched
    regex = []

    #Append to a list those dates that have the end year "2011"
    regex.append("((?:January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Sept|Oct|Nov|Dec)[\.]*[,]*[ ](?:0?[1-9]|[12][0-9]|3[01])[,|\.][ ](?:(?:20|'|`)[1][1]))")

    #Join all the dates matched on the webpage from the regex by a comma
    regex = ','.join(regex)

    #Find the matching date format from the opened webpage 
    #[Recursion depth error happens here]
    dates = re.findall(regex, str(webpage))

    #If there aren't any dates that match, then go to the next link
    if dates == []:
        print "There was no matching date found in row " + CurrentRow
        j += 1
        continue

    #Print the dates that match the RegEx and the row that they are on
    print "A date was found in the link at row " + CurrentRow
    print dates
    j += 1

Answer 1

我不认为

regex.append("...")

正在做您认为应该做的事情。

然后调用append方法，regex现在是一个保存正则表达式的单元素数组。以下连接向我表明您认为它应该是一个多元素数组。

修复后，我怀疑您的代码会更好用。

Answer 2

继续我的评论，你可以做的是创建许多不同的模式并迭代每个模式，而不是使用一个具有许多不同regex = "January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Sept|Oct|Nov|Dec" regex = ["((?:"+month+")[\.]*[,]*[ ](?:0?[1-9]|[12][0-9]|3[01])[,|\.][ ](?:(?:20|'|`)[1][1]))" for month in regex.split("|")] matches = [] for pattern in regex: matches.append(re.findall(pattern, str(webpage))语句的模式。这样的事情可能有用：

24*33

这是一种更加迭代的方式，但这非常慢。这是因为它将每个月运行re.findall类型EVERY SINGLE WEBPAGE。正如您所看到的，如果您在问题中至少有33个链接，那么这将是re.findall {{1}}次运行。另外，我不是任何方式的python专家，我甚至不能完全确定这个解决方案会完全摆脱你的问题。

调用Python对象时超出了最大递归深度（递归到迭代）

2 个答案: