import sys
sys.setrecursionlimit(1500) #This increases the recursion limit, ultimately moving
#up the ceiling on the stack so it doesn't overflow.
查看此帖子了解详情:What is the maximum recursion depth in Python, and how to increase it?
--------------原始问题-----------------
我正在抓日期的网页。截至目前,我已经成功地使用re.findall以我正在搜索的格式提取日期,但是一旦我了解了第33个链接,我得到“调用Python对象时超出了最大递归深度”错误,它一直指向 dates = re.findall(regex,str(网页))对象。
从我读过的内容来看,我需要在我的代码中使用一个循环,以便我可以摆脱递归,但作为一个新手,我不确定如何更改处理的代码片段RegEx和re.findall从递归到迭代。提前感谢任何见解。
import urllib2
from bs4 import BeautifulSoup as BS
import re
#All code is correct between imports and the start of the For loop
for url in URLs:
...
#Open and read the URL and specify html.parser as the parsing agent so that the parsing method remains uniform across systems
webpage = BS(urllib2.urlopen(req).read(), "html.parser")
#Create a list to store the dates to be searched
regex = []
#Append to a list those dates that have the end year "2011"
regex.append("((?:January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Sept|Oct|Nov|Dec)[\.]*[,]*[ ](?:0?[1-9]|[12][0-9]|3[01])[,|\.][ ](?:(?:20|'|`)[1][1]))")
#Join all the dates matched on the webpage from the regex by a comma
regex = ','.join(regex)
#Find the matching date format from the opened webpage
#[Recursion depth error happens here]
dates = re.findall(regex, str(webpage))
#If there aren't any dates that match, then go to the next link
if dates == []:
print "There was no matching date found in row " + CurrentRow
j += 1
continue
#Print the dates that match the RegEx and the row that they are on
print "A date was found in the link at row " + CurrentRow
print dates
j += 1
答案 0 :(得分:0)
我不认为
regex.append("...")
正在做您认为应该做的事情。
然后调用append方法,regex现在是一个保存正则表达式的单元素数组。以下连接向我表明您认为它应该是一个多元素数组。
修复后,我怀疑您的代码会更好用。
答案 1 :(得分:0)
继续我的评论,你可以做的是创建许多不同的模式并迭代每个模式,而不是使用一个具有许多不同regex = "January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Sept|Oct|Nov|Dec"
regex = ["((?:"+month+")[\.]*[,]*[ ](?:0?[1-9]|[12][0-9]|3[01])[,|\.][ ](?:(?:20|'|`)[1][1]))" for month in regex.split("|")]
matches = []
for pattern in regex:
matches.append(re.findall(pattern, str(webpage))
语句的模式。这样的事情可能有用:
24*33
这是一种更加迭代的方式,但这非常慢。这是因为它将每个月运行re.findall类型EVERY SINGLE WEBPAGE。正如您所看到的,如果您在问题中至少有33个链接,那么这将是re.findall
{{1}}次运行。另外,我不是任何方式的python专家,我甚至不能完全确定这个解决方案会完全摆脱你的问题。