Question

root@bt:~# ./phemail.py -g0@*******.com
Gathering emails from domain: ******.com
Traceback (most recent call last):
  File "./phemail.py", line 206, in <module>
  gatherEmails(domain[0],domain[1],p)
  File "./phemail.py", line 51, in gatherEmails
  namesurname = re.sub(' -.*','',a.text.encode('utf8'))
AttributeError: 'NoneType' object has no attribute 'encode'

为什么a.text是NoneType类型？

Answer 1

a.text没有价值（None）
初始化a变量的行可能有问题。

顺便说一句，我不建议以root身份做事。

Answer 2

作为解释，脚本正在做的是使用Google搜索LinkedIn的索引页面，特别是用于显示用户姓名的页面（而不是公司简介，工作，讨论等）。由于目标公司名称（可能是该公司的标准电子邮件格式）已知（并在脚本的args中指定），因此搜索似乎会查找提及公司的所有LI个人资料页面结果，提取名称并生成e - 来自姓名的邮件地址。它不是在抓取电子邮件地址，甚至不是域名 - 它正在掠夺名称。

它实际上表明缺乏对LI如何使公共配置文件对搜索引擎可见（或容忍大量垃圾结果）的理解，因为您的结果将充满“目录”页面，而不是配置文件。

但是除了那个战略错误之外，你还使用了错误的脚本 - Google不支持每个字符的通配符 - 通配符主要表示一个或多个单词可能介于（或之前/之后） - 但它有效最好的）到其他的话。但是，通配符行为有点棘手，但并未完全记录所有情况。因此，即使以后没有失败，您的输出将是出现在LinkedIn的非常通用的“site：”搜索中的前100个名称（没有任何公司/域信息）。不确定这对任何人有用吗？

至于为什么脚本在该特定行上失败，您将遍历对于搜索结果项的a-tags的BeautifulSoup.findAll调用的输出。在这种情况下，a.text的值和类型为“None”，这会导致错误，因为None没有encode（）方法。 BeautifulSoup有很多很棒的快捷方式，但是他们可能会因为错误而追查。 findAll的结果是一组标签，标签的默认值就像findAll一样，所以我认为a.text就像在交互循环的单个标签上调用findAll（'text'）。我不能肯定地说为什么这不起作用 - 我在这台机器上没有BeautifulSoup - 但你应该能够玩这个并看看它出了什么问题。

相关部分：

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
headers={'User-Agent':user_agent,}
p = 10

def gatherEmails(l,domain,p):
    print "Gathering emails from domain: "+domain
    emails = []
    for i in range(0,p):
        url = "http://www.google.co.uk/search?hl=en&safe=off&q=site:linkedin.com/pub+"+re.sub('\..*','',domain)+"&start="+str(i)+"0"
        request=urllib2.Request(url,None,headers)
        response = urllib2.urlopen(request)
        data = response.read()
        html = BeautifulSoup(data)
        for a in html.findAll('a',attrs={'class':'l'}):
            namesurname = re.sub(' -.*','',a.text.encode('utf8'))
            firstname = re.sub(' ([a-zA-Z])+','',namesurname).lower()
            surname = re.sub('([a-zA-Z])+ ','',namesurname).lower()
            sys.stdout.write("\r%d%%" %((100*(i+1))/p))
            sys.stdout.flush()
            if firstname != surname and not re.search('\W',firstname) and not re.search('\W',surname):                
                if l == '0' : # 1- firstname.surname@example.com
                    emails.append(firstname+" "+surname)

Answer 3

你在3.0.8之前使用的是Beautiful Soup版本。升级到.text，.getText（分隔符）和（在Beautiful Soup 4中）.get_text（separator）。

phemail.py - 没有属性'encode'

3 个答案: