我正在使用cython和一个相当大的for循环 - 超过一百万。当我作为常规python程序运行时,常规运行大约需要40分钟。
vetdns.pyx并在声明函数 -
之下标记了cdef变量now = datetime.datetime.now()
today = now.strftime("%Y-%m-%d")
my_date = date.today()
dayoftheweek=calendar.day_name[my_date.weekday()]
#needed because of the weird naming and time objects vs datetime objects
read_date = datetime.datetime.strptime(today, '%Y-%m-%d')
previous_day = read_date - datetime.timedelta(days=1)
yesterday = previous_day.strftime('%Y-%m-%d')
my_dir = os.getcwd()
# extracted = "extracted_"+today
outname = "alexa_all_vetted"+today
downloaded_file = "top-1m"+today+".zip"
INPUT_FILE="dns-all"
OUTPUT_FILE="dns_blacklist_"+dayoftheweek
REMOVE_FILE="dns_blacklist_upto"+yesterday
PATH = "/home/money/Documents/hybrid"
FULL_FILENAME= os.path.join(PATH, OUTPUT_FILE)
CLEANUP_FILENAME=os.path.join(PATH, REMOVE_FILE)
##cdef outname, INPUT_FILE, OUTPUT_FILE labeled just inside function.
def main():
zip_file_url = "http://s3.amazonaws.com/alexa-static/top-1m.csv.zip"
urllib.urlretrieve(zip_file_url, downloaded_file)
###naming variables affected in for loop
cdef outname, INPUT_FILE, OUTPUT_FILE
with zipfile.ZipFile(downloaded_file) as zip_file:
for member in zip_file.namelist():
filename = os.path.basename(member)
# skip directories
if not filename:
continue
# copy file (taken from zipfile's extract)
source = zip_file.open(member)
target = file(os.path.join(my_dir, filename), "wb")
with source, target:
shutil.copyfileobj(source, target)
whitelist = open(outname,'w')
with open(member,'r') as member:
reader = csv.reader(member, delimiter=',')
alexia_hosts = []
for row in reader:
alexia_hosts.append(row[1])
whitelist.write("\n".join(alexia_hosts))
file_out=open(FULL_FILENAME,"w")
with open(INPUT_FILE, 'r') as dnsMISP:
with open(outname, 'r') as f:
alexa=[]
alexafull=[]
blacklist = []
for line in f:
line = line.strip()
alexahostname=urltools.extract(line)
alexa.append(alexahostname[4])
alexafull.append(line)
for line in dnsMISP:
line = line.strip()
hostname = urltools.extract(line)
# print hostname[4]
if hostname[4] in alexa:
print hostname[4]+",this hostname is in alexa"
pass
elif hostname[5] in alexafull:
print hostname[5]+",this hostname is in alexafull"
else:
blacklist.append(line)
file_out.write("\n".join(blacklist))
file_out.close()
main()
内置setup.py
from distutils.core import setup
from Cython.Build import cythonize
setup(
ext_modules = cythonize("vetdns.pyx")
)
但是当我跑步时
python setup.py build_ext --inplace
我收到以下错误 -
Error compiling Cython file:
------------------------------------------------------------
...
source = zip_file.open(member)
target = file(os.path.join(my_dir, filename), "wb")
with source, target:
shutil.copyfileobj(source, target)
whitelist = open(outname,'w')
^
------------------------------------------------------------
vetdns.pyx:73:25: local variable 'outname' referenced before assignment
现在这可能有点超出我的想法,但无论如何我想要玩它。
答案 0 :(得分:2)
您在此行声明outname
作为本地变量:
cdef outname, INPUT_FILE, OUTPUT_FILE
但是你永远不会给它分配任何东西。 Python要求在使用变量之前分配变量,没有默认值将它们初始化为。
我看到你有一个名为“outname”的全局变量,如果你想使用全局变量,你不需要在你的函数中使用cdef
。这同样适用于您的其他全局变量。
你可以尝试的一件事,对我来说效果很好,就是将循环弹出到一个cythonized函数中。这样,调试/优化的cython代码就会减少,但是当大部分处理时间花费在几行代码中时(通常就是这种情况),编译这些代码就会产生很大的不同。在实践中,这看起来有点像这样:
# my_script.py
import os
from my_helper import bottle_neck
def main():
a = 12
b = 22
c = 999
# More prep code
print bottle_neck(a, b, c)
main()
在另一个文件中:
# my_helper.pyx
def bottle_neck(int a, int b, int c):
# Silly example, this loop might never finish
while a > 0:
a = a & b
b = a - c
c = b * a
return a, b, c
确保你对自己的代码进行了分析,只有在你花时间进行优化之后才发现它实际上很快就会很慢。