在python中将字符串分成块

时间:2016-02-29 10:57:39

标签: python

我用语料库中的提取词编写代码,然后将它们标记化并与句子进行比较。输出是Bag of Words(如果单词在句子1中,如果不是0则)。

import nltk
import numpy as np
from nltk import FreqDist
from nltk.corpus import brown


news = brown.words(categories='news') 
news_sents = brown.sents(categories='news') 

fdist = FreqDist(w.lower() for w in news) 
vocabulary = [word for word, _ in fdist.most_common(100)] 
num_sents = len(news_sents) 

for i in range(num_sents):
    features = {}
    for word in vocabulary: 
        features[word] = int(word in news_sents[i]) 

    bow = "".join(str(n) for n in list(features.values()))
    f = open("D:\\test\\Vector.txt", "a") 
    print(bow, file=f) 
    f.close()

在这种情况下,输出字符串长度为100个字符。我想将它拆分成任意长度的块并为其分配块号。例如:

print(i+1, chunk_id, bow, sep="\t", end="\n", file=f)

其中i + 1将是句子ID。为了想象我的意思,让我们采取长度为12>>的字符串。 “110010101111”和“011011000011”。它应该看起来像:

1 1 1100
1 2 0101
1 3 1111
2 1 0110
2 2 1100
2 3 0011

1 个答案:

答案 0 :(得分:0)

来自itertools documentation的石斑鱼功能似乎正是您所寻找的:

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)