Question

我正在使用Python中的Mechanize和BeautifulSoup4编写一个Web爬虫。为了存储它收集的数据以供进一步分析，我使用搁置模块。出现问题的代码块就在这里。

url_dict=shelve.open("url_dict.dat")
html=r.read()
soup=BeautifulSoup(html)
frames=soup.find_all("a",{"class":br_class})#br_class is defined globally
time.sleep(1)
for item in frames:
    url_suffix=item['href']
    full_url=url_prefix+url_suffix
    full_url=full_url.encode('ascii','ignore')
    if str(full_url) not in url_dict:
        url_dict[str(full_url)]=get_information(full_url,sr)
    time.sleep(1)

但是，此代码确实设法在遇到错误之前经历一个循环。函数get_information()从以下开始：

def get_information(full_url,sr):   
    information_set=dict()
    r=sr.open(full_url)
    information_set['url']=full_url
    print("Set url")
    html=r.read()
    soup=BeautifulSoup(html)
    information_set["address"]=soup.find("h1",{"class":"prop-addr"}).text

sr是一个读取url的浏览器对象，url_suffix是一个unicode字符串。从get_information（）返回的对象是一个字典对象。所以url_dict是一本字典词典。

在此代码的第二个循环中，我遇到以下错误：

Traceback (most recent call last):
  File "collect_re_data.py", line 219, in <module>
    main()
  File "collect_re_data.py", line 21, in main
    data=get_html_data()
  File "collect_re_data.py", line 50, in get_html_data
    url_dict[str(full_url)]=get_information(full_url,sr)
  File "C:\Python27\lib\shelve.py", line 132, in __setitem__
    p.dump(value)
  File "C:\Python27\lib\copy_reg.py", line 74, in _reduce_ex
    getstate = self.__getstate__
RuntimeError: maximum recursion depth exceeded while calling a Python object

此外，是否有更好的方法来处理数据存储？我的最终目标是将所有数据传输到.csv文件中，以便我可以在R中进行分析。

Answer 1

这是一个带有泡菜和BeautifulSoup的known issue。我怀疑搁置的问题是相关的。

在Python中使用搁置模块时遇到无限递归

1 个答案: