Question

您好我正在与python 3合作，我现在已经面对这个问题了一段时间，我似乎无法弄清楚这一点。

我有2个包含strings

的numpy数组

array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])

如果您注意到，array_one实际上是一个包含1-gram, 2-gram, 3-gram, 4-gram, 5-gram alice in a wonder land句子的数组。

我有意将wonderland作为两个单词wonder和land。

现在我有另一个numpy array，其中包含一些位置和名称。

array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])

现在我要做的是获取array_one中array_two中存在的所有元素。

如果我使用两个数组中的np.intersect1d取出一个十字路口我没有得到任何匹配，因为wonderland是array_one中的两个单独的单词，而array_two则是python 3一个字。

有没有办法做到这一点？我已尝试过堆栈（this）的解决方案，但它们似乎不适用于array_one

array_two最多会有60-100个项目，而white space最多会有大约100万个项目，但平均有250,000到500,000个项目。

修改

我使用了一种非常天真的方法，因为我现在无法找到解决方案，我从arrays替换boolean然后使用生成的import numpy.core.defchararray as np_f import numpy as np array_two_wr = np_f.replace(array_two, ' ', '') array_one_wr = np_f.replace(array_one, ' ', '') intersections = array_two[np.in1d(array_two_wr, array_one_wr)]数组（ [True，False，True]）来过滤原始数组。以下是代码：

array_two

但我不确定这是考虑{{1}}
中元素数量的方法

Answer 1

Minhashing绝对可以在这里使用。这是minhashing背后的一般概念：对于列表中的每个对象，多次散列对象，并更新跟踪为每个列表成员计算的散列的对象。然后检查结果哈希的集合，并为每个哈希找到计算该哈希的所有对象（我们只存储了这些数据）。如果仔细选择散列函数，则计算相同散列的对象将非常相似。

有关minhashing的更详细说明，请参阅Mining Massive Datasets的第3章。

以下是使用数据和datasketch（pip install datasketch）的Python 3实现示例，它计算哈希值：

import numpy as np
from datasketch import MinHash, MinHashLSH
from nltk import ngrams

def build_minhash(s):
  '''Given a string `s` build and return a minhash for that string'''
  new_minhash = MinHash(num_perm=256)
  # hash each 3-character gram in `s`
  for chargram in ngrams(s, 3):
    new_minhash.update(''.join(chargram).encode('utf8'))
  return new_minhash

array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])
array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])

# create a structure that lets us query for similar minhashes
lsh = MinHashLSH(threshold=0.3, num_perm=256)

# loop over the index and value of each member in array two
for idx, i in enumerate(array_two):
  # add the minhash to the lsh index
  lsh.insert(idx, build_minhash(i))

# find the items in array_one with 1+ matches in arr_two
for i in array_one:
  result = lsh.query(build_minhash(i))
  if result:
    matches = ', '.join([array_two[j] for j in result])
    print(' *', i, '--', matches)

结果（左侧array_one成员，右侧array_two匹配）：

 * wonder -- wonderland
 * a wonder -- wonderland
 * wonder land -- wonderland
 * a wonder land -- wonderland
 * in a wonder land -- wonderland
 * alice in a wonder land -- wonderland

这里调整精度/召回的最简单方法是将threshold参数更改为MinHashLSH。您也可以尝试修改散列技术本身。在为每个ngram构建minhash时，我使用了3个字符的哈希值，耶鲁的数字人文实验室在捕获文本相似性时发现了非常强大的技术：https://github.com/YaleDHLab/intertext

Answer 2

很抱歉发布两个答案，但在添加上面的locality-sensitive-hashing技术后，我意识到你可以通过使用bloom过滤器来利用数据中的类分离（查询向量和潜在的匹配向量）。

布隆过滤器是一个漂亮的对象，它允许您传入一些对象，然后查询是否已将特定对象添加到布隆过滤器。这是一个awesome visual demo of a bloom filter。

在您的情况下，我们可以将array_two的每个成员添加到布隆过滤器，然后查询array_one的每个成员，看看它是否在布隆过滤器中。使用pip install bloom-filter：

from bloom_filter import BloomFilter # pip instal bloom-filter
import numpy as np
import re

def clean(s):
  '''Clean a string'''
  return re.sub(r'\s+', '', s)

array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])
array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])

# initialize bloom filter with particular size
bloom = BloomFilter(max_elements=10000, error_rate=0.1)
# add each member of array_two to bloom filter
[bloom.add(clean(i)) for i in array_two]
# find the members in array_one in array_two
matches = [i for i in array_one if clean(i) in bloom]
print(matches)

结果：['wonder land']

根据您的要求，这可能是一种非常有效（且高度可扩展）的解决方案。

将字符串从一个numpy数组匹配到另一个

修改

2 个答案: