Question

我之前使用NLTK构建了文本分类模型。现在我需要构建一个类似的模型，但这次我必须使用多个代码的令牌而不是一个单词令牌。此外，我的输入数据已经从要分类的文档中提取了已经提取的特征（历史数据）和多字符标记。我没有原始数据。

我打算制作一个文档短语矩阵，然后是分类器。

PS：NLTK允许ngram tokenizer，但在这里我没有源数据。我只提供多措辞的代币。

输入：

features = ['food security','indigenous groups','national forest','wood forest']

Doc1 = ['food security','indigenous groups','wood forest']

Doc2 = ['national forest','wood forest']

Doc3 = ['tree products', 'forest resources']

Doc4 = ['local population']

Doc5 = []

输出：

+----------+---------------+-------------------+-----------------+-------------+
| Features | food security | indigenous groups | national forest | wood forest |
+----------+---------------+-------------------+-----------------+-------------+
| Doc1     |             1 |                 1 |               0 |           1 |
| Doc2     |             0 |                 0 |               1 |           1 |
| Doc3     |             0 |                 0 |               0 |           0 |
| Doc4     |             0 |                 0 |               0 |           0 |
| Doc5     |             0 |                 0 |               0 |           0 |
+----------+---------------+-------------------+-----------------+-------------+

从此开始，我计划建立一个分类器。

现在我的计划是将这些多措辞的代币/功能转换为单个单词，方法是替换＆＃39; ＆＃39;用＆＃39; _＆＃39;。＆＃39;粮食安全＆＃39;变成了food_security＆＃39;。但我知道这是非常糟糕的做法。我正在寻找更好的方法

如何创建多字文档术语矩阵？

0 个答案: