Question

我正在尝试让以下程序正常运行。它应该在一个网站上找到电子邮件地址，但它正在打破。我怀疑问题是在抓取功能中初始化result = []。以下是代码：

# -*- coding: utf-8 -*-
import requests
import re
import urlparse

# In this example we're trying to collect e-mail addresses from a website

# Basic e-mail regexp:
# letter/number/dot/comma @ letter/number/dot/comma . letter/number
email_re = re.compile(r'([\w\.,]+@[\w\.,]+\.\w+)')

# HTML <a> regexp
# Matches href="" attribute
link_re = re.compile(r'href="(.*?)"')

def crawl(url, maxlevel):
    result = []
    # Limit the recursion, we're not downloading the whole Internet
    if(maxlevel == 0):
        return

    # Get the webpage
    req = requests.get(url)
    # Check if successful
    if(req.status_code != 200):
        return []

    # Find and follow all the links
    links = link_re.findall(req.text)
    for link in links:
        # Get an absolute URL for a link
        link = urlparse.urljoin(url, link)
        result += crawl(link, maxlevel - 1)

    # Find all emails on current page
    result += email_re.findall(req.text)
    return result

emails = crawl('http://ccs.neu.edu', 2)

print "Scrapped e-mail addresses:"
for e in emails:
    print e

我得到的错误如下：

C:\Python27\python.exe "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py"
Traceback (most recent call last):
  File "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py", line 41, in <module>
    emails = crawl('http://ccs.neu.edu', 2)
  File "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py", line 35, in crawl
    result += crawl(link, maxlevel - 1)
  File "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py", line 35, in crawl
    result += crawl(link, maxlevel - 1)
TypeError: 'NoneType' object is not iterable

Process finished with exit code 1

任何建议都会有所帮助。谢谢！

Answer 1

问题在于：

if(maxlevel == 0):
    return

目前，None时会返回maxlevel == 0。您无法将列表与None对象连接起来。您需要返回一个空列表[]才能保持一致。

TypeError：＆＃39; NoneType＆＃39; object不可迭代：Webcrawler来刮取电子邮件地址

1 个答案: