Question

我有一个很长的字符串，让我们说astr = "I am a very long string and I could contain a lot of text, so think of efficiency here"。我还有一个列表alist = ["I", "am a", "list", "of strings", "and each string", "could be made up of many words", "so think of efficiency here"]。现在，我的字符串列表也有一个相应的整数列表alist_ofints = [1, 2, 3, 4, 5, 6, 7]，它表示此列表中每个字符串等于的点的数量。

我应该创建一个函数，找出astr中有多少单词出现在列表alist中，并创建一个＆＃34;点＆＃34;计数器使用相应的点列表alist_ofints。因此，在这个例子中，单词＆＃34; I＆＃34;，＆＃34;是＆＃34;，＆＃34;所以在这里考虑效率＆＃34;分别出现两次，一次和一次。这会给我们1*2 + 2*1 + 7*1 = 11分。

我想出了两个天真的解决方案。第一个是创建一个查看此字符串列表alist的函数，并检查每个项目是否在astr中，如果是，则应用明显的后续逻辑。这样效率很低，因为我会astr len(alist) astr次调查i。那是浪费，不是吗？它很干净，但很低效。

第二个解决方案是让j成为一个单词列表，我会检查索引i的每个单词到索引j，其中alist就在哪里我在列表中，"and each string"是我要查找的astr中短语的长度。所以，＆＃34;我是一个＆＃34;是一个长度为2的短语（因为它有两个单词），所以我会看i =某个数字，j =某个数字+ 1.如果我正在寻找短语alist，i = some数字，j =某个数字+ 3.所以我在测试这个短语时会看三个单词。现在，我认为这也具有相同的时间复杂性。虽然我没有循环遍历len(list(astr))列表，但我正在遍历我的单词列表astr list(astr)次。另外，我必须创建一个steps: - task: AzureCLI@1 inputs: azureSubscription: 'MySub (xxxxxxxxxxxxxxx)' scriptLocation: inlineScript inlineScript: | az group create -l westeurope -n TestRG az group deployment create -g TestRG --mode Incremental --template-file azuredeploy.json --parameters @azuredeploy.parameters.json workingDirectory: Test列表，这会增加一些复杂性，我想。

所以，到目前为止，我更喜欢第一种解决方案，因为它是最简单，最简单，最干净的解决方案。有一个更好的方法吗？如果你能找到列表理解方式，那就加分吧......

谢谢

注意：我知道****************************************************************************** Starting: AzureCLI ****************************************************************************** ============================================================================== Task : Azure CLI Description : Run a Shell or Batch script with Azure CLI commands against an azure subscription Version : 1.130.0 Author : Microsoft Corporation Help : [More Information](http://go.microsoft.com/fwlink/?LinkID=827160) ============================================================================== az group deployment create -g TestRG --mode Incremental --template-file azuredeploy.json --parameters @azuredeploy.parameters.json C:\Windows\system32\cmd.exe /D /S /C ""C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\wbin\az.cmd" login --service-principal -u ******** -p ******** --tenant ********" [ { "cloudName": "AzureCloud", "id": "xxxxxxxxxxxxxxxxxxxxxxxxxxx", "isDefault": true, "name": "Test", "state": "Enabled", "tenantId": "********", "user": { "name": "********", "type": "servicePrincipal" } } ] C:\Windows\system32\cmd.exe /D /S /C ""C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\wbin\az.cmd" account set --subscription Test" C:\Windows\system32\cmd.exe /D /S /C ""C:\Users\VSSADM~1\AppData\Local\Temp\azureclitaskscript1520242163645.bat"" D:\a\1\s\Swoon>az group create -l westeurope -n TestRG { "id": "/subscriptions/xxxxxxxx/resourceGroups/TestRG", "location": "westeurope", "managedBy": null, "name": "TestRG", "properties": { "provisioningState": "Succeeded" }, "tags": null } ****************************************************************************** Finishing: AzureCLI ******************************************************************************不会返回单词列表。想象一下，对于这个例子，确实如此。

TLDR：我有两个清单。我需要检查列表中的每个元素是否等于另一个列表中的元素，并创建它们出现的次数。有没有更有效的方法来检查列表1中的每个元素与列表2中的每个其他元素（我认为这是O（n ^ 2））？

Answer 1

我写了seems to do exactly what you want：

这一行

print sum([str.count(s) * i for (s,i) in zip(alist, alist_ofints)])

这更像是你的第一种方法，但我发现它效率低下。

您应该注意的一件事是str.count(s)只在s中找到str的{{3}}的数量。

Answer 2

更高效的算法可以使用字符串索引（例如，Suffix Array）索引长字符串astr。然后搜索索引中alist中的每个条目，并在找到结果时相应地增加点数。

索引astr的运行时间是O（n），其中n是astr的长度。

从索引中长度为m的alist中搜索条目是在O（log n）

中

总的来说，你应该逃避O（p log n），其中p是alist中的条目数。

示例

让我们将长字符串astr视为


我是一个非常长的字符串

然后相应的后缀数组（全部小写）将是


SA = [1 4 6 11 16 5 2 8 22 15 0 20 12 3 21 14 13 19 9 17 18 7 10]

这些都是astr的后缀（由它们的起始索引表示）排序的词典。例如，SA[9] = 15表示从{15}开始的astr中的字符串（“g string”）。

现在让我们假设你的短语列表


alist = [“我是”，“很长”，...]

然后对于要在后缀数组中搜索事件的每个条目。这是使用后缀数组上的二进制搜索完成的。对于“我是”，这将如下所示：

首先看一下后缀数组的中间条目（SA [11] = 20）。然后你看一下该索引所代表的后缀（“ing”）。由于此后缀大于搜索短语“我是”，因此您需要查看后缀数组的左半部分。继续此二进制搜索，直到找到该短语，或者您确定它不在那里。

Answer 3

您可以为单词列表构建Trie数据结构，其中结束节点包含points数组的索引。

来自Wikipedia输入= ["A","to", "tea", "ted", "ten", "i", "in", and "inn"]的特里结构看起来像这样

<p><a href="https://commons.wikimedia.org/wiki/File:Trie_example.svg#/media/File:Trie_example.svg"><img src="https://upload.wikimedia.org/wikipedia/commons/b/be/Trie_example.svg" alt="Trie example.svg" height="145" width="155"></a><br>By <a href="https://en.wikipedia.org/wiki/User:Booyabazooka" class="extiw" title="en:User:Booyabazooka">Booyabazooka</a> (based on PNG image by <a href="https://en.wikipedia.org/wiki/User:Deco" class="extiw" title="en:User:Deco">Deco</a>). Modifications by <a href="//commons.wikimedia.org/wiki/User:Superm401" class="mw-redirect" title="User:Superm401">Superm401</a>. - own work (based on PNG image by <a href="https://en.wikipedia.org/wiki/User:Deco" class="extiw" title="en:User:Deco">Deco</a>), Public Domain, <a href="https://commons.wikimedia.org/w/index.php?curid=1197221">Link</a></p>

所以我们可以遍历输入字符串的整个长度，每当我们遇到单词节点的结尾时，添加它的点并继续。

因此，可以在线性时间内搜索整个单词。

但是如果重叠列表项如["ab", "cd", "abcd"]，则点[3, 4, 1]且单词为abcd。我们将无法在预处理后获得线性时间解决方案，因为在每次遇到单词结束后，最高点都可以来自任何一个。

将字符串扩展到目前为止的单词并进一步向前看。
开始从列表中查找剩余字符串作为单个字词。

构建Trie结构的时间和空间复杂性：O(w * m)其中w是单词的数量，m是列表中单词的最大大小。

可以在O(m)中进行搜索，其中m是要搜索的字词的长度。

Answer 4

（我认为这与thebenman的答案类似。）根据alist中的重叠类型，您可能会将alist转换为字典（或嵌套字典），即一棵树）：

{
  I: [(None, 1)],
  am: [(a, 2)],
  list: [(None, 3)],
  of: [(strings,4)],
  and: [(each, 0), (string, 5)],
  could: [(be, 0), (made, 0)...,(words, 6)],
  so: [(think, 0), (of, 0)...,(here, 7)]
}

现在我们可以将作为单词遍历astr而不对其进行索引，保留对所有当前打开的累积匹配的引用和更新。

Answer 5

您还可以生成所有可能的子序列，在其上使用Counter，然后查找时间几乎为O（1）。

这将需要更多内存来生成字典（或索引），但如果您需要多次查找相同的长字符串，它会更有效。

这样的事情：

from collections import Counter


def get_all_counts(input_string):
    cnt = Counter()
    length = len(input_string)
    alist = []
    s = input_string.split()
    for i in range(0, len(s)):
        current_subsequence = ''
        for j in range(i, len(s)):
            current_subsequence += ' ' + s[j]
            cnt[current_subsequence.strip()] += 1 # I've put 1 here, but you could easily replace it with a lookup of your "points"
    return cnt


counts = get_all_counts(
    'I am a very long string and I could contain a lot of text, so think of efficiency here')

print(counts['am'])
print(counts['of'])

可能使用itertools会更好，但你应该明白这一点。

这样做的另一个好处是你可以把它变成一个pandas数据帧并对它进行查询。

例如：

df = pd.DataFrame.from_dict(counts, orient='index').reset_index()

print(df[df[0] > 1])

会给出所有出现大于1的子字符串。

有效地检查字符串列表中字符串中的单词

5 个答案: