Question

我还有另外一个问题，因为这个社区对我有很大的帮助，我想我会再给它一次。

现在我有Python 3代码导入CSV文件，其中第一列充满了以下格式的单词：

The
Words
Look
Like
This
In
A
Column

Python上传并读取此CSV文件后，将使用NLTK POS Tagger标记单词。从那里，所有单词进行排列，然后将结果导出到新的CSV文件。现在，我的完整代码就像这样

Import CSV
with open(r'C:\Users\jkk\Desktop\python.csv', 'r') as f:
    reader = csv.reader(f)
    J = []
    for row in reader:
      J.extend(row)
import nltk
D = nltk.pos_tag(J)
C = list(itertools.permutations(D, 3))
with open('test.csv', 'w') as a_file:
    for result in C:
    result = ' '.join(result)
    a_file.write(result + '\n')

我的问题是，如何基于单词标签为单词排列制定规则？更具体地说，我标记单词的原因是因为我不想要无意义的排列（即这个In / A This In / etc）。一旦用各自的词性标记了单词，我如何根据其标签编码规则（例如）：永远不要让两个“DT”标记的单词相互跟随（即“The”和“A”）。或者总是有一个带有NN标记的单词后跟一个VBG标记的单词（即“看起来”总是出现在“单词”之后）？最后，一旦实施了这些规则，摆脱标签，只留下原始单词？我意识到这是一个普遍的问题，但任何指导都会非常感激如何处理这个问题，因为我仍然很新，并且学习每一步！任何资源，代码，甚至建议都会有所帮助！再次感谢您花时间阅读这篇长篇文章！

Answer 1

在语言中定义合法字符串的规则集称为语法（或正式语法）。有许多形式可以让您表达这些规则。一个相当简单的实验是无上下文语法（CFG）。 NLTK附带了从这些工具生成字符串的工具。这是NLTK book's chapter on syntax。它们更深入。

以下代码适用于带有NLTK 3.0a4的python 3。 API在NLTK 2和3之间更改，因此它不会在旧版本上运行。

from nltk import ContextFreeGrammar
from nltk.parse.generate import generate
from ntlk.util import trigrams

# build a simple grammar
cfg = """
S -> NP VP
VP -> VBZ NP
NP -> DT | NN | DT NN | DT JJ NN | JJ NN
"""

# you get these from your csv
words = "this is a simple sentence".split()
tagged = set(pos_tag(words))
# Add the words to the grammar
for word, tag in tagged:
    cfg += "{tag} -> '{word}'\n".format(word=word, tag=tag)
grammar = parse_cfg(cfg)

valid_trigrams = set()

language = generate(grammar)
for valid_sentence in language:
    valid_trigrams.update(list(trigrams(valid_sentence)))

print(valid_trigrams)
# {('simple', 'sentence', 'is'), ('this', 'is', 'this'), ('a', 'sentence', 'is'), ('sentence', 'is', 'a'), ('a', 'is', 'a'), ('this', 'is', 'simple'), ('sentence', 'is', 'this'), ('this', 'is', 'sentence'), ('is', 'a', 'sentence'), ('is', 'a', 'simple'), ('a', 'simple', 'sentence'), ('a', 'is', 'this'), ('this', 'simple', 'sentence'), ('this', 'is', 'a'), ('is', 'simple', 'sentence'), ('a', 'is', 'simple'), ('this', 'sentence', 'is'), ('is', 'this', 'sentence'), ('sentence', 'is', 'sentence'), ('sentence', 'is', 'simple'), ('is', 'this', 'simple'), ('a', 'is', 'sentence')}

但这种方法存在局限性，因为无上下文语法无法涵盖所有英语。但是，没有已知的方法来验证英语的语法，所以你只能有一个近似的解决方案。

您应该注意的另一件事是POS标记器假定单词的顺序是相关的。粗略地说，它为每个单词提供了一组可能的标记，然后根据前面和后面的单词对其进行细化，所以如果你的CSV包含句子，那么你没问题，否则，你可能想要指定unigram pos tagger { {1}}，但无论如何，您只会获得最常见的标记。对于像“run”这样的词来说，这可能是动词或名词（“早晨跑”与“我跑”）。

基于单词的POS标签创建Word排列规则

1 个答案: