Question

情境： 我为各自的＆＃34; Section Header＆＃34;（存储为字符串）执行了一些任务，该任务的结果必须保存在相应的＆＃34;现有的Section Header＆＃34;（存储为字符串）

在映射各自的任务＆＃34; Section Header＆＃34;是＃34;现有部分标题之一＆＃34;任务结果被添加到它。如果没有，新的Section Header将被附加到Existing Section Header List。

现有的部分标题看起来像这样：

[＆＃34;活动（过去3天）＆＃34;，＆＃34;活动（过去7天）＆＃34;，＆＃34;可执行文件从磁盘＆＃34;，＆＃34;文件＆＃34;]
中的操作

对于以下字符串集，预期行为如下：

＆＃34;活动（过去30天） - 应添加新部分

＆＃34;从磁盘运行的可执行文件＆＃34; - 相同的现有＆＃34;可执行文件从磁盘运行＆＃34;应该参考[考虑额外的＆＃34; s＆＃34;在可执行文件中与＆＃34;可执行文件＆＃34;。

相同

＆＃34;文件中的动作＆＃34; - 来自文件＆＃34;的相同的现有＆＃34;动作应该参考[考虑额外的文章＆＃34; a＆＃34;]

是否有任何可用的内置函数python可以帮助合并相同的逻辑。或者对此算法的任何建议都非常感谢。

Answer 1

在这种情况下，您可能会发现regular expressions有帮助。您可以使用re.sub()查找特定的子字符串并替换它们。它将搜索与正则表达式的非重叠匹配，并使用指定的字符串重新对其进行重新匹配。

import re #this will allow you to use regular expressions

def modifyHeader(header):
    #change the # of days to 30
    modifiedHeader = re.sub(r"Activity (Last \d+ Days?)", "Activity (Last 30 Days)", header)
    #add an s to "executable"
    modifiedHeader = re.sub(r"Executable running from disk", "Executables running from disk", modifiedHeader)
    #add "a"
    modifiedHeader = re.sub(r"Actions from File", "Actions from a file", modifiedHeader)

    return modifiedHeader

r""引用raw strings，这使得处理正则表达式所需的\字符变得容易一些，\d匹配任何数字字符，+ 1}}表示＆＃34; 1或更多＆＃34;。阅读我上面链接的页面以获取更多信息。

Answer 2

由于您只想比较给定单词的词干或“词根”，我建议使用一些词干算法。词干算法尝试自动删除后缀（在某些情况下为前缀）以便找到给定单词的“根词”或词干。这在各种自然语言处理场景中很有用，例如搜索。幸运的是，stemming有一个python包。您可以从here下载。

接下来你要比较没有停用词的字符串（a，an，the，from等）。所以你需要在比较字符串之前过滤这些单词。您可以从互联网上获取停用词列表，也可以使用nltk包导入停用词列表。您可以从here

获取nltk

如果nltk存在任何问题，则以下是停用词列表：

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself',
 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which',
 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be',
 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an',
 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for',
 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',
 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all',
 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not',
 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don',
 'should', 'now']

现在使用这个简单的代码来获得所需的输出：

from stemming.porter2 import stem
from nltk.corpus import stopwords
stopwords_ =  stopwords.words('english')
def addString(x):
   flag = True
   y = [stem(j).lower() for j in x.split() if j.lower() not in stopwords_]
   for i in section:
      i = [stem(j).lower() for j in i.split() if j.lower() not in stopwords_]
      if y==i:
         flag = False
         break
   if flag:
      section.append(x)
      print "\tNew Section Added"

演示：

>>> from stemming.porter2 import stem
>>> from nltk.corpus import stopwords
>>> stopwords_ =  stopwords.words('english')
>>> 
>>> def addString(x):
...    flag = True
...    y = [stem(j).lower() for j in x.split() if j.lower() not in stopwords_]
...    for i in section:
...       i = [stem(j).lower() for j in i.split() if j.lower() not in stopwords_]
...       if y==i:
...          flag = False
...          break
...    if flag:
...       section.append(x)
...       print "\tNew Section Added"
... 
>>> section = [ "Activity (Last 3 Days)", "Activity (Last 7 days)", "Executable running from disk", "Actions from File"]  # initial Section list
>>> addString("Activity (Last 30 Days)")
    New Section Added
>>> addString("Executables running from disk")
>>> addString("Actions from a file")
>>> section
['Activity (Last 3 Days)', 'Activity (Last 7 days)', 'Executable running from disk', 'Actions from File', 'Activity (Last 30 Days)']  # Final section list

需要字符串比较的Python部分字符串比较解决方案

2 个答案: