令牌化列表理解

时间:2018-11-24 14:26:21

标签: python python-3.x token list-comprehension

我创建此代码的目的是使用大量语料库样本来确定在同时应用数量和大小写归一化的情况下减少词汇量的程度。

def vocabulary_size(sentences):
    tok_counts = {}
    for sentence in sentences: 
        for token in sentence:
            tok_counts[token]=tok_counts.get(token,0)+1
    return len(tok_counts.keys())

rcr = ReutersCorpusReader()    

sample_size = 10000

raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]

lowered_sentences = [sentence.lower() for sentence in tokenised_sentences] # something going wrong here
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences] # something going wrong here

raw_vocab_size = vocabulary_size(tokenised_sentences)
normalised_vocab_size = vocabulary_size(normalised_sentences)
print("Normalisation produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
    100*(raw_vocab_size - normalised_vocab_size)/raw_vocab_size,raw_vocab_size,normalised_vocab_size))

尽管按原样,它只会按原样打印每个字符。我认为我已将问题定位为2行。列表没有.lower()属性,所以我不确定如何替换它。

我还认为我可能必须在我的normalized_sentences中输入lower_sentences。

这是我的规范化功能:

def normalise(token):
    print(["NUM" if token.isdigit() 
    else "Nth" if re.fullmatch(r"[\d]+(st|nd|rd|th)", token) 
    else token for token in token])  

尽管我甚至可能都不打算利用此特定的规范化功能。抱歉,我正在以错误的方式进行攻击,我会再提供更多信息。

2 个答案:

答案 0 :(得分:3)

我看到一些可以为您解决的东西。

 lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences] # something going wrong here
 normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here

在这里您已经忘记实际使用正确的变量了,您可能是想

 lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
 normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]

而且,由于列表不具有功能lower(),因此您必须将其应用于每个句子中的每个标记,即

 lowered_sentences = [[token.lower() for token in sentence] for sentence in tokenised_sentences]

此外,您的normalise(token)仅使用print不会返回任何内容。所以列表理解

 normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here

None之外不生成任何列表。

我建议您不要使用列表推导,而是从使用正态的for循环开始,直到您有了算法,然后在需要速度时再进行转换。

答案 1 :(得分:2)

您似乎在理解中使用了错误的变量:

# Wrong
lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences]

# Right
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]

但是,如果您想规范小写句子,我们也需要更改该行:

# Right
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(sentence) for sentence in lowered_sentences]