Question

我有一个文本数据集。那些数据集由许多行组成，每行由两个用tab分隔的句子组成，如下所示：

this is string 1, first sentence.    this is string 2, first sentence.
this is string 1, second sentence.    this is string 2, second sentence.

然后我通过以下代码拆分了数据文本：

#file readdata.py
from globalvariable import *
import os


class readdata:
    def dataAyat(self):
        global kalimatayat
        fo = open(os.path.join('E:\dataset','dataset.txt'),"r")
        line = []
        for line in fo.readlines():
            datatxt = line.rstrip('\n').split('\t')
            newdatatxt = [x.split('\t') for x in datatxt]
            kalimatayat.append(newdatatxt)
            print newdatatxt

readdata().dataAyat()

它有效，输出为：

[['this is string 1, first sentence.'],['this is string 2, first sentence.']]
[['this is string 1, second sentence.'],['this is string 2, second sentence.']]

我想要做的是使用nltk word tokenize对这些列表进行标记，我期望的输出是这样的：

[['this' , 'is' , 'string' , '1' , ',' , 'first' , 'sentence' , '.'],['this' , 'is' , 'string' , '2' , ',' , 'first' , 'sentence' , '.']]
[['this' , 'is' , 'string' , '1' , ',' , 'second' , 'sentence' , '.'],['this' , 'is' , 'string' , '2' , ',' , 'second' , 'sentence' , '.']]

任何人都知道如何将其标记为上面的输出？我想在“tokenizer.py”中编写一个tokenize函数，并在“mainfile.py”中调用它全部

Answer 1

要标记句子列表，迭代它并将结果存储在列表中：

data = [[['this is string 1, first sentence.'],['this is string 2, first sentence.']],
[['this is string 1, second sentence.'],['this is string 2, second sentence.']]]
results = []
for sentence in data:
    sentence_results = []
    for s in sentence:
        sentence_results.append(nltk.word_tokenize(sentence))
    results.append(sentence_results)

结果将类似于

[[['this' , 'is' , 'string' , '1' , ',' , 'first' , 'sentence' , '.'],  
  ['this' , 'is' , 'string' , '2' , ',' , 'first' , 'sentence' , '.']], 
[['this' , 'is' , 'string' , '1' , ',' , 'second' , 'sentence' , '.'],
  ['this' , 'is' , 'string' , '2' , ',' , 'second' , 'sentence' , '.']]]

如何使用nltk标记单词列表？

1 个答案: