我用语料库中的提取词编写代码,然后将它们标记化并与句子进行比较。输出是Bag of Words(如果单词在句子1中,如果不是0则)。
import nltk
import numpy as np
from nltk import FreqDist
from nltk.corpus import brown
news = brown.words(categories='news')
news_sents = brown.sents(categories='news')
fdist = FreqDist(w.lower() for w in news)
vocabulary = [word for word, _ in fdist.most_common(100)]
num_sents = len(news_sents)
for i in range(num_sents):
features = {}
for word in vocabulary:
features[word] = int(word in news_sents[i])
bow = "".join(str(n) for n in list(features.values()))
f = open("D:\\test\\Vector.txt", "a")
print(bow, file=f)
f.close()
在这种情况下,输出字符串长度为100个字符。我想将它拆分成任意长度的块并为其分配块号。例如:
print(i+1, chunk_id, bow, sep="\t", end="\n", file=f)
其中i + 1将是句子ID。为了想象我的意思,让我们采取长度为12>>的字符串。 “110010101111”和“011011000011”。它应该看起来像:
1 1 1100
1 2 0101
1 3 1111
2 1 0110
2 2 1100
2 3 0011
答案 0 :(得分:0)
来自itertools documentation的石斑鱼功能似乎正是您所寻找的:
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)