Question

我有一个分为单词的名字（字符串）列表。有800万个名字，每个名字最多包含20个单词（代币）。唯一令牌的数量是220万。我需要一种有效的方法来查找包含查询中至少一个单词的所有名称（最多可包含20个单词，但通常只包含几个单词）。

我目前的方法使用Python Pandas，看起来像这样（后面称为original）：

>>> df = pd.DataFrame([['foo', 'bar', 'joe'], 
                       ['foo'], 
                       ['bar', 'joe'], 
                       ['zoo']], 
                      index=['id1', 'id2', 'id3', 'id4'])
>>> df.index.rename('id', inplace=True)  # btw, is there a way to include this into prev line?
>>> print df

       0     1     2
id                  
id1  foo   bar   joe
id2  foo  None  None
id3  bar   joe  None
id4  zoo  None  None

def filter_by_tokens(df, tokens):
    # search within each column and then concatenate and dedup results    
    results = [df.loc[lambda df: df[i].isin(tokens)] for i in range(df.shape[1])]
    return pd.concat(results).reset_index().drop_duplicates().set_index(df.index.name)

>>> print filter_by_tokens(df, ['foo', 'zoo'])

       0     1     2
id                  
id1  foo   bar   joe
id2  foo  None  None
id4  zoo  None  None

目前这种查找（在完整数据集上）在我的（相当强大的）机器上需要5.75秒。我想至少加速它，比如10次。

通过将所有列压缩到一个并对其执行查找（后面称为original, squeezed），我能够达到5.29：

>>> df = pd.Series([{'foo', 'bar', 'joe'}, 
                    {'foo'}, 
                    {'bar', 'joe'}, 
                    {'zoo'}], 
                    index=['id1', 'id2', 'id3', 'id4'])
>>> df.index.rename('id', inplace=True)
>>> print df

id
id1    {foo, bar, joe}
id2              {foo}
id3         {bar, joe}
id4              {zoo}
dtype: object

def filter_by_tokens(df, tokens):
    return df[df.map(lambda x: bool(x & set(tokens)))]

>>> print filter_by_tokens(df, ['foo', 'zoo'])

id
id1    {foo, bar, joe}
id2              {foo}
id4              {zoo}
dtype: object

但那还不够快。

另一个似乎易于实现的解决方案是使用Python多处理（由于GIL并且没有I / O，因此线程不应该提供帮助，对吧？）。但问题是需要将大数据帧复制到每个进程，这会占用所有内存。另一个问题是我需要在循环中多次调用filter_by_tokens，因此它会在每次调用时复制数据帧，这是低效的。

请注意，单词可能会在名称中出现多次（例如，最常用的单词在名称中出现600k次），因此反向索引会很大。

有效写这个的好方法是什么？ Python解决方案更受欢迎，但我也对其他语言和技术（例如数据库）开放。

UPD： 我已经测量了我的两个解决方案的执行时间以及@piRSquared在answer中建议的5个解决方案。结果如下（tl; dr最好是2倍的改进）：

+--------------------+----------------+
|       method       | best of 3, sec |
+--------------------+----------------+
| original           | 5.75           |
| original, squeezed | 5.29           |
| zip                | 2.54           |
| merge              | 8.87           |
| mul+any            | MemoryError    |
| isin               | IndexingError  |
| query              | 3.7            |
+--------------------+----------------+

mul+any在d1 = pd.get_dummies(df.stack()).groupby(level=0).sum()（在128Gb RAM计算机上）上发出MemoryError。

isin在IndexingError: Unalignable boolean Series key provided上提供s[d1.isin({'zoo', 'foo'}).unstack().any(1)]，显然是因为df.stack().isin(set(tokens)).unstack()的形状略小于原始数据框的形状（8.39M对8.41M行），而不是确定为什么以及如何解决这个问题。

请注意，我使用的机器有12个核心（虽然我上面提到了并行化的一些问题）。所有解决方案都使用单核。

结论（截至目前）：zip（2.54s）与original squeezed解决方案（5.29s）相比提高了2.1倍。它很好，但如果可能的话，我的目标是至少提高10倍。所以我现在暂不接受（仍然很棒）@piRSquared的答案，以欢迎更多的建议。

Answer 1

idea 0
zip

def pir(s, token):
    return s[[bool(p & token) for p in s]]

pir(s, {'foo', 'zoo'})

想法1
merge

token = pd.DataFrame(dict(v=['foo', 'zoo']))
d1 = df.stack().reset_index('id', name='v')
s.ix[d1.merge(token).id.unique()]

想法2
mul + any

d1 = pd.get_dummies(df.stack()).groupby(level=0).sum()
token = pd.Series(1, ['foo', 'zoo'])
s[d1.mul(token).any(1)]

想法3
isin

d1 = df.stack()
s[d1.isin({'zoo', 'foo'}).unstack().any(1)]

idea 4
query

token = ('foo', 'zoo')
d1 = df.stack().to_frame('s')
s.ix[d1.query('s in @token').index.get_level_values(0).unique()]

Answer 2

我使用以下工具做过类似的事情

Hbase - Key can have Multiple columns (Very Fast)
ElasticSearch - Nice easy to scale. You just need to import your data as JSON

Apache Lucene - 对于800万条记录非常好

Answer 3

你可以用反向索引来做;下面的代码在pypy中运行，在57秒内构建索引，查询或20个单词占用0.00018秒并使用大约3.2Gb内存。 Python 2.7在158秒内构建索引，并使用大约3.41Gb内存在0.0013秒内进行查询。

最快的方法是使用位图反转索引，压缩以节省空间。

"""
8m records with between 1 and 20 words each, selected at random from 100k words
Build dictionary of sets, keyed by word number, set contains nos of all records
with that word
query merges the sets for all query words
"""
import random
import time   records = 8000000
words = 100000
wordlists = {}
print "build wordlists"
starttime = time.time()
wordlimit = words - 1
total_words = 0
for recno in range(records):
    for x in range(random.randint(1,20)):
        wordno = random.randint(0,wordlimit)
        try:
           wordlists[wordno].add(recno)
        except:
           wordlists[wordno] = set([recno])
        total_words += 1
print "build time", time.time() - starttime, "total_words", total_words
querylist = set()
query = set()
for x in range(20):
    while 1:
        wordno = (random.randint(0,words))
        if  wordno in wordlists: # only query words that were used
            if  not wordno in query:
                query.add(wordno)
                break
print "query", query
starttime = time.time()
for wordno in query:
    querylist.union(wordlists[wordno])
print "query time", time.time() - starttime
print "count = ", len(querylist)
for recno in querylist:
    print "record", recno, "matches"

Answer 4

也许我的第一个答案有点抽象;在没有真实数据的情况下，它产生了随机数据。卷需求。了解查询时间。这段代码很实用。

data =[['foo', 'bar', 'joe'],
       ['foo'],
       ['bar', 'joe'],
       ['zoo']]

wordlists = {}
print "build wordlists"
for x, d in enumerate(data):
    for word in d:
        try:
           wordlists[word].add(x)
        except:
           wordlists[word] = set([x])
print "query"
query = [ "foo", "zoo" ]
results = set()
for q in query:
    wordlist = wordlists.get(q)
    if  wordlist:
        results = results.union(wordlist)
l = list(results)
l.sort()
for x in l:
    print data[x]

时间和记忆的成本是建立单词表（倒排索引）;查询几乎是免费的。你有12台核心机器，所以可能它有足够的内存。为了重复性，构建单词列表，挑选每个单词列表并写入sqlite或任何键/值数据库，单词为key，pickled set为二进制blob。那么你所需要的只是：

initialise_database()
query = [ "foo", "zoo" ]
results = set()                             
for q in query:                             
    wordlist = get_wordlist_from_database(q) # get binary blob and unpickle
    if  wordlist:                        
        results = results.union(wordlist)
l = list(results)
l.sort()   
for x in l:      
    print data[x]

或者，使用数组，这样可以提高内存效率，并且可能更快地构建索引。 pypy比2.7

快10倍

import array

data =[['foo', 'bar', 'joe'],
       ['foo'],
       ['bar', 'joe'],
       ['zoo']]

wordlists = {}
print "build wordlists"
for x, d in enumerate(data):
    for word in d:
        try:
           wordlists[word].append(x)
        except:
           wordlists[word] = array.array("i",[x])
print "query"
query = [ "foo", "zoo" ]
results = set()
for q in query:
    wordlist = wordlists.get(q)
    if  wordlist:
        for i in wordlist:
            results.add(i)
l = list(results)
l.sort()
for x in l:
    print data[x]

Answer 5

如果您知道您将看到的唯一令牌数量相对较少，你可以很容易地构建一个有效的位掩码来查询匹配。

天真的方法（在原始帖子中）将允许多达64个不同的令牌。

下面改进的代码使用位掩码，就像布隆过滤器一样（模拟算法设置位绕64位）。如果有超过64个唯一令牌，则会出现一些误报，下面的代码会自动验证（使用原始代码）。

如果唯一令牌的数量（大于）大于64，或者如果您特别不走运，那么最坏情况下的性能会降低。哈希可以减轻这一点。

就性能而言，使用下面的基准数据集，我得到：

原始代码： 4.67秒

位掩码： 0.30秒

但是，当增加唯一令牌的数量时，位掩码代码保持有效，而原始代码显着减慢。有大约70个独特的令牌，我得到类似的东西：

原始代码： ~15秒

位掩码： 0.80秒

注意：对于后一种情况，从提供的列表构建位掩码数组需要花费与构建数据帧相同的时间。构建数据帧可能没有真正的理由;留下它主要是为了便于与原始代码进行比较。

class WordLookerUpper(object):
    def __init__(self, token_lists):
        tic = time.time()
        self.df = pd.DataFrame(token_lists,
                    index=pd.Index(
                        data=['id%d' % i for i in range(len(token_lists))],
                        name='index'))
        print('took %d seconds to build dataframe' % (time.time() - tic))
        tic = time.time()
        dii = {}
        iid = 0
        self.bits = np.zeros(len(token_lists), np.int64)
        for i in range(len(token_lists)):
            for t in token_lists[i]:
                if t not in dii:
                    dii[t] = iid
                    iid += 1
                # set the bit; note that b = dii[t] % 64
                # this 'wrap around' behavior lets us use this
                # bitmask as a probabilistic filter
                b = dii[t]
                self.bits[i] |= (1 << b)
        self.string_to_iid = dii
        print('took %d seconds to build bitmask' % (time.time() - tic))

    def filter_by_tokens(self, tokens, df=None):
        if df is None:
            df = self.df
        tic = time.time()
        # search within each column and then concatenate and dedup results    
        results = [df.loc[lambda df: df[i].isin(tokens)] for i in range(df.shape[1])]
        results = pd.concat(results).reset_index().drop_duplicates().set_index('index')
        print('took %0.2f seconds to find %d matches using original code' % (
                time.time()-tic, len(results)))
        return results

    def filter_by_tokens_with_bitmask(self, search_tokens):
        tic = time.time()
        bitmask = np.zeros(len(self.bits), np.int64)
        verify = np.zeros(len(self.bits), np.int64)
        verification_needed = False
        for t in search_tokens:
            bitmask |= (self.bits & (1<<self.string_to_iid[t]))
            if self.string_to_iid[t] > 64:
                verification_needed = True
                verify |= (self.bits & (1<<self.string_to_iid[t]))
        if verification_needed:
            results = self.df[(bitmask > 0 & ~verify.astype(bool))]
            results = pd.concat([results,
                                 self.filter_by_tokens(search_tokens,
                                    self.df[(bitmask > 0 & verify.astype(bool))])])
        else:
            results = self.df[bitmask > 0]
        print('took %0.2f seconds to find %d matches using bitmask code' % (
                time.time()-tic, len(results)))
        return results

制作一些测试数据

unique_token_lists = [
    ['foo', 'bar', 'joe'], 
    ['foo'], 
    ['bar', 'joe'], 
    ['zoo'],
    ['ziz','zaz','zuz'],
    ['joe'],
    ['joey','joe'],
    ['joey','joe','joe','shabadoo']
]
token_lists = []
for n in range(1000000):
    token_lists.extend(unique_token_lists)

运行原始代码和位掩码代码

>>> wlook = WordLookerUpper(token_lists)
took 5 seconds to build dataframe
took 10 seconds to build bitmask

>>> wlook.filter_by_tokens(['foo','zoo']).tail(n=1)
took 4.67 seconds to find 3000000 matches using original code
id7999995   zoo None    None    None

>>> wlook.filter_by_tokens_with_bitmask(['foo','zoo']).tail(n=1)
took 0.30 seconds to find 3000000 matches using bitmask code
id7999995   zoo None    None    None

通过常用词

5 个答案: