Question

我有一组大小为20的固定单词。我有一个包含20,000个记录的大文件，其中每个记录都包含一个字符串，我想查找固定集中的任何单词是否存在于字符串中，如果存在，这个词的索引。

例如

GroundOverlayOptions newarkMap = new GroundOverlayOptions() 
            .position(new LatLng(0, 0), 27000000f, 12735849f)  
            .image(BitmapDescriptorFactory.fromResource(R.drawable.map3));
googleMap.addGroundOverlay(newarkMap);

我想知道是否有更好/更快的方法来做到这一点。

Answer 1

您可以将列表理解与double for循环一起使用：

s1=set(["barely","rarely", "hardly"])

l2 = ["i hardly visit", "i do not visit", "i can barely talk"]

locations = [c for c, b in enumerate(l2) for a in s1 if a in b]

在此示例中，输出为：

[0, 2]

但是，如果您想要一种访问某个单词出现的索引的方法：

from collections import defaultdict

d = defaultdict(list)

for word in s1:
   for index, sentence in l2:
       if word in sentence:
           d[word].append(index)

Answer 2

这种建议只会消除一些明显的低效率，但不会影响解决方案的整体复杂性：

def find_word(text, s1=s1): # micro-optimization, make s1 local
    tokens = nltk.word_tokenize(text)    
    for i, word in in enumerate(tokens):
        if word in s1:
           # Do something with `word` and `i`

基本上，当你真正需要的是你的循环体内的一个条件时，使用map会减慢速度......所以基本上，只要摆脱get_token_index，它就会超过 - 工程改造。

Answer 3

这应该有效：

strings = []
for string in l2:
    words = string.split(' ')
    for s in s1:
        if s in words:
            print "%s at index %d" % (s, words.index(s))

Answer 4

最简单的方法和稍微更有效的方法是使用Python生成器函数

index_tuple = list（（l2.index（i）for i in s1 i in l2））

您可以计算时间并检查其效果与您的要求的效率

快速搜索单词python列表中的一组单词

4 个答案: