我的句子存储在这样的文本文件中。
radiologicalreport =1. MDCT OF THE CHEST History: A 58-year-old male, known case lung s/p LUL segmentectomy. Technique: Plain and enhanced-MPR CT chest is performed using 2 mm interval. Previous study: 03/03/2018 (other hospital) Findings: Lung parenchyma: The study reveals evidence of apicoposterior segmentectomy of LUL showing soft tissue thickening adjacent surgical bed at LUL, possibly post operation.
我的最终目标是应用LDA将每个句子分类为一个主题。在此之前,我想对文本进行一种热编码。我面临的问题是我想在一个numpy数组中对每个句子进行一个热编码,以便能够将其输入到LDA中。如果我想对全文进行热编码,则可以使用这两行轻松地完成编码。
sent_text = nltk.sent_tokenize(text)
hot_encode=pd.Series(sent_text).str.get_dummies(' ')
但是,我的目标是在numpy数组中每个句子进行一种热编码。因此,我尝试以下代码。
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
import nltk
import pandas as pd
from nltk.tokenize import TweetTokenizer, sent_tokenize
with open('radiologicalreport.txt', 'r') as myfile:
report=myfile.read().replace('\n', '')
tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in
nltk.sent_tokenize(report)]
tokens_np = array(tokens_sentences)
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(tokens_np)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
我在此行收到一条错误,提示为“ TypeError:不可哈希类型:'list'”
integer_encoded = label_encoder.fit_transform(tokens_np)
因此无法继续进行。 另外,我的tokens_sentences如下图所示。
请帮助!
答案 0 :(得分:1)
您正尝试使用fit_transform
将标签转换为数值(在您的示例中,标签是单词列表-tokens_sentences
)。
但是非数字标签只有能够可哈希且具有可比性(请参阅the docs),才能进行转换。列表不可散列,但您可以将它们转换为元组:
tokens_np = array([tuple(s) for s in tokens_sentences])
# also ok: tokens_np = [tuple(s) for s in tokens_sentences]
然后您可以将句子编码为integer_encoded
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(tokens_np)