我找到了一个名为“Best Email Extractor”的程序http://www.emailextractor.net/。该网站称它是用Python编写的。我试着写一个类似的程序。上述程序每分钟可提取约300至1000封电子邮件。我的程序每小时提取大约30-100封电子邮件。有人可以给我一些关于如何提高我的程序性能的技巧吗?我写了以下内容:
import sqlite3 as sql
import urllib2
import re
import lxml.html as lxml
import time
import threading
def getUrls(start):
urls = []
try:
dom = lxml.parse(start).getroot()
dom.make_links_absolute()
for url in dom.iterlinks():
if not '.jpg' in url[2]:
if not '.JPG' in url[2]:
if not '.ico' in url[2]:
if not '.png' in url[2]:
if not '.jpeg' in url[2]:
if not '.gif' in url[2]:
if not 'youtube.com' in url[2]:
urls.append(url[2])
except:
pass
return urls
def getURLContent(urlAdresse):
try:
url = urllib2.urlopen(urlAdresse)
text = url.read()
url.close()
return text
except:
return '<html></html>'
def harvestEmail(url):
text = getURLContent(url)
emails = re.findall('[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}', text)
if emails:
if saveEmail(emails[0]) == 1:
print emails[0]
def saveUrl(url):
connection = sql.connect('url.db')
url = (url, )
with connection:
cursor = connection.cursor()
cursor.execute('SELECT COUNT(*) FROM urladressen WHERE adresse = ?', url)
data = cursor.fetchone()
if(data[0] == 0):
cursor.execute('INSERT INTO urladressen VALUES(NULL, ?)', url)
return 1
return 0
def saveEmail(email):
connection = sql.connect('emails.db')
email = (email, )
with connection:
cursor = connection.cursor()
cursor.execute('SELECT COUNT(*) FROM addresse WHERE email = ?', email)
data = cursor.fetchone()
if(data[0] == 0):
cursor.execute('INSERT INTO addresse VALUES(NULL, ?)', email)
return 1
return 0
def searchrun(urls):
for url in urls:
if saveUrl(url) == 1:
#time.sleep(0.6)
harvestEmail(url)
print url
urls.remove(url)
urls = urls + getUrls(url)
urls1 = getUrls('http://www.google.de/#hl=de&tbo=d&output=search&sclient=psy-ab&q=DVD')
urls2 = getUrls('http://www.google.de/#hl=de&tbo=d&output=search&sclient=psy-ab&q=Jolie')
urls3 = getUrls('http://www.finanzen.net')
urls4 = getUrls('http://www.google.de/#hl=de&tbo=d&output=search&sclient=psy-ab&q=Party')
urls5 = getUrls('http://www.google.de/#hl=de&tbo=d&output=search&sclient=psy-ab&q=Games')
urls6 = getUrls('http://www.spiegel.de')
urls7 = getUrls('http://www.kicker.de/')
urls8 = getUrls('http://www.chessbase.com')
urls9 = getUrls('http://www.nba.com')
urls10 = getUrls('http://www.nfl.com')
try:
threads = []
urls = (urls1, urls2, urls3, urls4, urls5, urls6, urls7, urls8, urls9, urls10)
for urlList in urls:
thread = threading.Thread(target=searchrun, args=(urlList, )).start()
threads.append(thread)
print threading.activeCount()
for thread in threads:
thread.join()
except RuntimeError:
print RuntimeError
答案 0 :(得分:3)
我认为很多人不会帮助您收集电子邮件。这是一个普遍厌恶的活动。
关于代码中的性能瓶颈,您需要通过分析找出时间的位置。在最低级别,使用不进行处理但返回有效输出的虚拟替换每个函数;因此,电子邮件收集器可以返回相同地址的列表100次(或者这些URL结果中有多少)。这将告诉你哪个功能花费你的时间。
突出的事情:
\b
,这样匹配就可以减少回溯。set
的{{1}}甚至frozenset
,并与ignoredExtensions = set([jpg,png,gif])
进行比较,提取最终细分受众群,并针对多个值进行检查。另请注意,首先将扩展名转换为小写字母意味着更少检查和工作,无论是jpg还是JPG。