Question

我有一个包含4000万条目的文件：

#No Username

我有一个包含600万个项目的列表，其中每个项目都是用户名。

我想以最快的方式找到常用的用户名。这是我到目前为止所得到的：

import os
usernames=[]
common=open('/path/to/filf','w')
f=open('/path/to/6 million','r')
for l in os.listdir('/path/to/directory/with/usernames/'):
    usernames.append(l)
#noOfUsers=len(usernames)
for l in f:
    l=l.split(' ')
    if(l[1] in usernames):
        common.write(l[1]+'\n')
common.close()
f.close()

如何改善此代码的性能？

Answer 1

我看到两个明显的改进：首先，将用户名设置为一组。然后，创建一个结果列表并将'\n'.join(resultlist)写入文件一次。

import os

usernames = []

for l in os.listdir('/path/to/directory/with/usernames/'):
    usernames.append(l)

usernames = set(usernames)

f = open('/path/to/6 million','r')
resultlist = [] 
for l in f:
    l = l.split(' ')
    if (l[1] in usernames):
        resultlist.append(l[1])
f.close()

common=open('/path/to/filf','w')
common.write('\n'.join(resultlist) + '\n')
common.close()

编辑：假设你想要的只是找到最常见的名字：

usernames = set(os.listdir('/path/to/directory/with/usernames/'))
from collections import Counter

f = open('/path/to/6 million')
name_counts = Counter(line.split()[1] for line in f if line in usenames)
print name_counts.most_common()

Edit2：鉴于澄清，这里是如何创建一个文件，其中包含路径和600万行文件中用户名的通用名称：

import os
usernames = set(os.listdir('/path/to/directory/with/usernames/'))

f = open('/path/to/6 million')
resultlist = [line.split()[1] for line in f if line[1] in usernames]

common = open('/path/to/filf','w')
common.write('\n'.join(resultlist) + '\n')
common.close()

Answer 2

如果你创建一个用户名作为键的dict，那么用于测试dict中键的存在的算法比测试列表中元素的存在要快得多。

Answer 3

如果这是一项操作，您将执行多次，我可以建议线程吗？以下是一些伪代码。

首先，在Linux中将文件拆分为100,000行文件：

> split -l 100000 usernames.txt usernames_

然后，产生一些线程以并行方式执行此操作。

 import threading
 usernames_one = set()
 usernames_two = set()
 filereaders = []

 # Define this class, which puts all the lines in the file into a set
 class Filereader(threading.Thread):
  def __init__(self, filename, username_set):
    # while 1:
    # read a line from filename, put it in username_set
  ...

 # loop through possible usernames_ files, and spawn a thread for each:
 # for.....
 f = Filereader('usernames_aa', usernames_one)
 filereaders.append(f)
 f.start()
 # do the same loop for usernames_two

 # at the end, wait for all threads to complete
 for f in filereaders:
     f.join()

 # then do simple set intersection:
 common_usernames = usernames_one ^ usernames_two

 # then write common set to a file:
 common_file = open("common_usernames.txt",'w')
 common_file.write('\n'.join(common_usernames))

您必须检查set addition是否是线程安全的过程。如果没有，你当然可以创建一个集合列表（一个用于线程处理的每个文件），最后在交叉之前将它们全部联合起来。

将一个文件中的4000万行与Python中的600万个列表项进行比较

3 个答案: