Question

问题：

我正致力于一个数据分析项目，该项目要求我将未知单词的子串与好词和坏词的语料库进行比较。

我最初生成了4个列表，并将它们以pickle格式存储在磁盘中。

-rw-rw-r-- 1 malware_corpus malware_corpus 189M May  4 13:11 clean_a.pkl
-rw-rw-r-- 1 malware_corpus malware_corpus 183M May  4 13:12 clean_b.pkl
-rw-rw-r-- 1 malware_corpus malware_corpus 1.7M Apr 30 11:12 data_backup.csv
-rw-rw-r-- 1 malware_corpus malware_corpus 2.9M May  4 13:13 data.csv
-rw-rw-r-- 1 malware_corpus malware_corpus 231M May  4 13:12 mal_a.pkl
-rw-rw-r-- 1 malware_corpus malware_corpus 232M May  4 13:13 mal_b.pkl

因此，在我的代码中，每当出现一个新字符串时，我将采用这4个列表并将子字符串比较到这4个列表并计算得分。由于存储在内存中的所有这4个列表，程序很慢

此外，每个列表都有数百万个单词，如果我想进行搜索，我需要花费更长的时间，因为它需要花费O（n）时间。

需要解决方案：

存储4个列表的任何有效方法，以便它们不会增加我的记忆。
搜索4个列表中字符串的更好方法。
如何在python中访问大型列表。

代码部分：

    def create_corpus(self):
    """corpus

    :param domain: Doamin passed will be split and words are stored in
    corpus.
    """
    with open(os.path.join(os.path.dirname(os.path.realpath(__file__)),'utils/x.txt'),'r') as f:
        for line in f:
            line = line.rstrip()

            self.line_x = self.calculate_xs(line)
            for i in self.line_x:
                self.clean_xs.append(i)
            self.line_y = self.calculate_ys(line)
            for i in self.line_y:
                self.clean_ys.append(i)
    with open(os.path.join(os.path.dirname(os.path.realpath(__file__)),'utils/y.txt'),'r') as f:
        for line in f:
            line = line.rstrip()
            self.line_x = self.calculate_x(line)
            for i in self.line_x:
                self.mal_xs.append(i)
            self.line_y = self.calculate_y(line)
            for i in self.line_y:
                self.mal_ys.append(i)

    # Store the Datasets in pickle Formats
    with open(os.path.join(os.path.dirname(os.path.realpath(__file__)),\
                           'utils/clean_x.pkl'),'wb') as f:
        pickle.dump(self.clean_xs , f)

    with open(os.path.join(os.path.dirname(os.path.realpath(__file__)),\
                           'utils/clean_ys.pkl'),'wb') as f:
        pickle.dump(self.clean_ys , f)
    with open(os.path.join(os.path.dirname(os.path.realpath(__file__)),\
                           'utils/mal_xs.pkl'),'wb') as f:
        pickle.dump(self.mal_xs , f)
    with open(os.path.join(os.path.dirname(os.path.realpath(__file__)),\
                           'utils/mal_ys.pkl'),'wb') as f:
        pickle.dump(self.mal_ys, f)
    return 1


def compare_score_function(self,domain):
    self.domain = domain
    self.g_freq = {}
    self.b_score = 0.0
    from collections import Counter
    for x in self.substrings_of_domain:
        self.g_freq[x] = {}
        self.g_freq[x]['occur'] = self.clean_x.count(x)
        self.g_freq[x]['freq']  = self.clean_x.count(x)/len(self.clean_x)
    for key,value in self.g_freq.iteritems():
        self.b_score += value['freq']
    return self.b_score

def calculate_x(self,domain):
    self.domain = self.clean_url(domain)
    self.bgrm = list(ngrams(self.domain,2))
    self.bgrm = [''.join(a) for a in self.bgrm ]
    return self.bgrm

def calculate_y(self,domain):
    self.domain = self.clean_url(domain)
    self.tgrm = list(ngrams(self.domain,3))
    self.tgrm = [''.join(a) for a in self.tgrm]
    return self.tgrm

示例说明

clean_x_list = [＆＃39; ap＆＃39;，＆＃39; pp＆＃39;，＆＃39; pl＆＃39;，＆＃39; le＆＃39; bo＆＃39; bo＆＃39; ，＆＃39; xl＆＃39;，＆＃39; ap＆＃39;]
clean_y_list = [＆＃39; apa＆＃39;，＆＃39; ppa＆＃39;，＆＃39; fpl＆＃39;，＆＃39; lef＆＃39;＆＃39; bfo＆＃39; ，＆＃39; xdl＆＃39;，＆＃39; mpd＆＃39;]
bad_x_list = [＆＃39; ti＆＃39;，＆＃39; qw＆＃39;，＆＃39; zx＆＃39;，＆＃39; qa＆＃39;，＆＃39; qa＆＃39; ，＆＃39; qa＆＃39;，＆＃39; uy＆＃39;]
bad_y_list = [＆＃39; zzx＆＃39;，＆＃39; zxx＆＃39;，＆＃39; qww＆＃39;，＆＃39; qww＆＃39;，＆＃39; qww＆＃39; ＆＃39; uyx＆＃39;＆＃39; uyx＆＃39;]

这里假设这些是我的4个列表：

我的新字符串来了 - 假设苹果 - 现在我将为apple =＆gt;计算x个单词[＆＃39; AP＆＃39;＆＃39; PP＆＃39;＆＃39; PL＆＃39;＆＃39;文件＆＃39;] - 现在我将为apple =＆gt;计算y个单词[＆＃39;应用＆＃39;＆＃39; PPL＆＃39;＆＃39; PLE＆＃39;＆＃39; LEA＆＃39;]

现在我将搜索苹果的每个x字词，即[＆＃39; ap＆＃39;，＆＃39; pp＆＃39;，＆＃39; pl＆＃39;＆＃39; le＆＃39; ]在clean_x_list和bad_x_list
然后我将计算频率和出现次数
在clean_x_list = 2
在clean_x_list = 2/7
在bad_x_list = 0
在bad_x_list = 0/7

类似地，我计算其他词出现和频率，最后总结

Answer 1

考虑对您的列表进行排序，并使用bisect搜索您的列表。在这种情况下，最差情况查找时间为O（log n）。

Answer 2

基本上有三个选项：O（n）中列表的线性扫描，......

>>> lst = random.sample(range(1, 1000000), 100000)
>>> x = lst[50000]
>>> %timeit x in lst
100 loops, best of 3: 2.12 ms per loop

...使用bisect模块在O（logn）的排序列表中进行二进制搜索，...

>>> srt = sorted(lst)
>>> srt[bisect.bisect_left(srt, x)] == x
True
>>> %timeit srt[bisect.bisect_left(srt, x)] == x
1000000 loops, best of 3: 444 ns per loop

...并在O（1）中的哈希set中查找：

>>> st = set(lst)
>>> %timeit x in st
10000000 loops, best of 3: 38.3 ns per loop

显然，set是迄今为止最快的，但它也比基于list的方法占用更多的内存。 bisect方法可能是一个很好的折衷方案，比本例中的线性扫描快5000倍，只需要对列表进行排序。

>>> sys.getsizeof(lst)
800064
>>> sys.getsizeof(srt)
900112
>>> sys.getsizeof(st)
4194528

但是，除非您的计算机内存非常有限，否则这应该不是问题。特别是，它不会使代码变慢。要么它都适合记忆，一切都很好，或者它没有，你的程序停止了。

如果您的好/坏单词列表可能包含重复项，那么set不是一个选项，bisect也不会有效。在这种情况下，为每个列表创建一个Counter。然后，您可以获取文本中每个子字符串的出现次数和频率。作为一种哈希映射/字典，Counter中的查找也将是O（1）。

>>> clean_x_list = ['ap','pp','pl','le','bo','xl','ap']
>>> w = "apple"
>>> wx = [w[i:i+2] for i in range(len(w)-1)]
>>> ccx = collections.Counter(clean_x_list)

>>> occ_wx = {x: ccx[x] for x in wx}
>>> occ_wx
{'ap': 2, 'pp': 1, 'pl': 1, 'le': 1}

>>> freq_wx = {x: ccx[x] / len(clean_x_list) for x in wx}
>>> freq_wx
{'ap': 0.2857142857142857,
 'pp': 0.14285714285714285,
 'pl': 0.14285714285714285,
 'le': 0.14285714285714285}

类似于clean_y_list，bad_x_list等等。

Answer 3

节省空间的一个选择是以压缩方式存储word文件，但也不会将整个word文件读入内存。为此，一个简单的选项是gzip.GzipFile，它允许您像常规文件一样操作gzip存档：

import gzip

with gzip.open('input.gz','rt') as text_f:
    for line in text_f:
        line = line.strip()
        print(line)

这样，您可以将文件中的每一行视为列表中的项目，并相应地处理它们。

请注意rt（或wt）open方法，它会将其作为文本处理，而不是二进制 - 这取决于您是仅存储纯文本/ json，还是使用数据的二进制格式（如pickle）。

Python：如何以高效的方式搜索大型数组？

3 个答案: