如何将单词转换为句子字符串 - 文本分类

时间:2017-04-13 02:52:06

标签: python nltk text-mining text-classification corpus

所以我目前正在与布朗语料库合作,我有一个小问题。为了应用标记化特征,我首先需要将布朗语料库放入句子中。这就是我到目前为止所做的:

from nltk.corpus import brown
import nltk


target_text = [s for s in brown.fileids()
                   if s.startswith('ca01') or s.startswith('ca02')]

data = []

total_text = [s for s in brown.fileids()
                   if s.startswith('ca01') or s.startswith('ca02') or s.startswith('cp01') or s.startswith('cp02')]


for text in total_text:

    if text in target_text:
        tag = "pos"
    else:
        tag = "neg"
    words=list(brown.sents(total_text))    
    data.extend( [(tag, word) for word in words] )

data

当我这样做时,我得到的数据如下:

[('pos',
  ['The',
   'Fulton',
   'County',
   'Grand',
   'Jury',
   'said',
   'Friday',
   'an',
   'investigation',
   'of',
   "Atlanta's",
   'recent',
   'primary',
   'election',
   'produced',
   '``',
   'no',
   'evidence',
   "''",
   'that',
   'any',
   'irregularities',
   'took',
   'place',
   '.']),
 ('pos',
  ['The',
   'jury',
   'further',
   'said',
   'in',
   'term-end',
   'presentments',
   'that',
   'the',
   'City',
   'Executive',
   'Committee',
   ',',
   'which',
   'had',
   'over-all',
   'charge',
   'of',
   'the',
   'election',
   ',',
   '``',
   'deserves',
   'the',
   'praise',
   'and',
   'thanks',
   'of',
   'the',
   'City',
   'of',
   'Atlanta',
   "''",
   'for',
   'the',
   'manner',
   'in',
   'which',
   'the',
   'election',
   'was',
   'conducted',
   '.'])

我需要的是看起来像:

[('pos', 'The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election ....'), ('pos', The jury further said in term-end presentments that the City...)]

有什么方法可以解决这个问题吗?这个项目比我预期的要长。

1 个答案:

答案 0 :(得分:1)

根据the docs,.sents方法会返回字符串(单词)列表(句子)的列表(文档) - 您的通话中没有任何错误。

如果你想重新构成句子,你可以尝试用空格加入它们。但由于标点符号,这不会真正起作用:

data.extend( [(tag, ' '.join(word)) for word in words] )

你会得到这样的东西:

'the',
'election',
',',
'``',
'deserves',
'the',

映射到:

the election , `` deserves the

因为加入不知道标点符号。 nltk是否包含某种标点符号格式化程序?