Question

我正在尝试在python中编写脚本以在文本中查找单词搭配。单词搭配是在各种文本中经常共同出现的一对单词。例如，在搭配＆＃34;柠檬皮和＃34;中，单词lemon和zest经常共同发生，因此它是搭配。现在我想使用re.findall来查找给定搭配的所有出现。与＆＃34;柠檬皮和＃34;不同，有些搭配在文本中不会彼此相邻。例如，在短语＆＃34;有趣的＆＃34;，因为＆＃34;＆＃34;是停用词，它已被删除。所以考虑到搭配＆＃34;有点好笑＆＃34;，一个程序必须返回＆＃34;有点搞笑＆＃34;作为输出。谁能告诉我怎么做？我应该提一下，我需要一个可扩展的approcah，因为我正在处理千兆字节的文本

EDIT1：

inputCollocation = "kind funny"
Document1 = "This film is kind of funny"
Document2 = "It is kind of funny"
Document3 = "That film is funny"


ExpectedOutput: Document1, Document2

提前谢谢你。

Answer 1

您可以使用字符串比较：

inputCollocation = "kind funny"
documents = dict(
    Document1 = "This film kind funny",
    Document2 = "It kind funny",
    Document3 = "That film funny",
)

def remove_stopwords(text):
    ...

matching = [ 
    document for (document, text) in documents.iteritems() 
    if inputCollocation in remove_stopwords(text.lower()) 
]
print 'ExpectedOutput:', ', '.join(matching)

您还可以考虑使用NLTK，其中包含查找搭配的工具。

RegEx：如何查找所有搭配实例？

1 个答案: