我有一个文本数据集,它是字符串列表列表的列表。我需要Tokenize
将该数据拟合到分类模型中。我非常熟悉使用keras.preprocessing.text.Tokenizer
来做到这一点,并且经常使用下面的代码来做到这一点:
data =
[[['not'],
['ahead'],
['um let me think'],
['thats not very encouraging if they had a cast of thousands on the other end']],
[['okay civil liberties tell me your position'],
['probably would go ahead']],
[['oh'],
['it up so i dont know where you really go'],
['well most of my problem with this latest task'],
['its some i kind of dont want to put in the time to do it'],
['right so im saying ive got a lot of other things to do']]]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data)
sequences = tokenizer.texts_to_sequences(data)
在数据上运行此代码时,出现以下错误:
2 frames
<ipython-input-44-1da804f42cc8> in main()
12 # tokenize and vectorize text data to prepare for embedding
13 tokenizer = Tokenizer()
---> 14 tokenizer.fit_on_texts(new_corpus)
15 sequences = tokenizer.texts_to_sequences(new_corpus)
16 word_index = tokenizer.word_index
/usr/local/lib/python3.6/dist-packages/keras_preprocessing/text.py in fit_on_texts(self, texts)
213 if self.lower:
214 if isinstance(text, list):
--> 215 text = [text_elem.lower() for text_elem in text]
216 else:
217 text = text.lower()
/usr/local/lib/python3.6/dist-packages/keras_preprocessing/text.py in <listcomp>(.0)
213 if self.lower:
214 if isinstance(text, list):
--> 215 text = [text_elem.lower() for text_elem in text]
216 else:
217 text = text.lower()
AttributeError: 'list' object has no attribute 'lower'
这对我来说很有意义,因为Tokenizer
函数需要一个字符串但得到一个列表。通常,我会展平列表结构以使其通过Tokenizer
函数。
但是,我无法执行此操作,因为嵌套列表结构对于我的建模至关重要。
那么,如何在保留列表结构的同时Tokenize
保留数据?我想把整个事情当作我的语料,并在所有列表中获得唯一的单词整数令牌。
它应该看起来像这样(在此处手动进行加标记,如果有错字,请原谅):
data =
[[[0],
['1'],
['2, 3, 4, 5'],
['6, 0, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19']],
[['20, 21, 22, 23, 3, 24, 25'],
['26, 27, 28, 29']],
[['30'],
['31, 32, 33, 34, 35, 36, 37, 38, 39, 40'],
['41, 42, 43, 44, 45, 46, 47, 48, 49'],
['50, 51, 34, 52, 14, 35, 53, 54, 55, 56, 17, 57, 58, 59, 31'],
['60, 61, 62, 63, 64, 65, 12, 66, 14, 67, 68, 59, 31']]]
答案 0 :(得分:2)
您可以执行以下操作以保留结构并进行索引,
["3"] = {"file1#100", "file2#200"},
["7"] = {"file2#200", "file3#300"},
["30"] = {"file4#400"},