我想把一堆文本文件合并为两个数组 - 一个'字流'和一个'文件流'。这是通过计算语料库中的单词令牌的总数然后创建数组来完成的,其中单词流中的每个条目对应于与该标记相关联的单词,并且文档流对应于单词来自的文档。
例如,如果语料库是
Doc1: "The cat sat on the mat"
Doc2: "The fox jumped over the dog"
单词流将如下所示:
WS: 1 2 3 4 1 5 1 6 7 8 1 9
DS: 1 1 1 1 1 1 2 2 2 2 2 2
我不确定如何做到这一点,所以我的问题基本上是这样的:如何将文本文件转换为单词标记数组?
答案 0 :(得分:2)
这样的东西?它是Python3代码,但我认为只在print
语句中有用。这些评论有一些未来添加的注释......
strings = [ 'The cat sat on the mat', # documents to process
'The fox jumped over the dog' ]
docstream = [] # document indices
wordstream = [] # token indices
words = [] # tokens themselves
# Return an array of words in the given string. NOTE: this splits up by
# spaces, in real life you might want to split by multiple spaces, newlines,
# tabs, what you have. See regular expressions in the module 're' and
# 're.split(...)'
def tokenize(s):
return s.split(' ')
# Lookup a token in the wordstream. If not present (yet), append it to the
# wordstream and return the new position. NOTE: in real life you might want
# to fold cases so that 'The' and 'the' are treated the same.
def lookup_token(token):
for i in range(len(words)):
if words[i] == token:
print('Found', token, 'at index', i)
return i
words.append(token)
print('Appended', token, 'at index', len(words) - 1)
return len(words) - 1
# Main starts here
for stringindex in range(len(strings)):
print('Analyzing string:', strings[stringindex])
tokens = tokenize(strings[stringindex])
for t in tokens:
print('Analyzing token', t, 'from string', stringindex)
docstream.append(stringindex)
wordstream.append(lookup_token(t))
# Done.
print(wordstream)
print(docstream)