我使用此代码从泡菜文件中加载评论数据集
stories = load(open('summarization/review_dataset.pkl', 'rb'))
print('Loaded Stories %d' % len(stories))
print(type(stories))
我得到这个结果
Loaded Stories 568411
<class 'list'>
我通过运行这些代码对数据进行矢量化
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()
for story in stories:
input_text = story['story']
for highlight in story['highlights']:
target_text = highlight
# use "tab" as the "start sequence" character
# for the targets, and "\n" as "end sequence" character.
target_text = '\t' + target_text + '\n'
input_texts.append(input_text)
target_texts.append(target_text)
for char in input_text:
if char not in input_characters:
input_characters.add(char)
for char in target_text:
if char not in target_characters:
target_characters.add(char)
input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])
print('Number of samples:', len(input_texts))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)
我收到了这个结果
Number of samples: 568411
Number of unique input tokens: 18
Number of unique output tokens: 3
Max sequence length for inputs: 15074
Max sequence length for outputs: 5
代替此结果
Number of samples: 568411
Number of unique input tokens: 84
Number of unique output tokens: 48
Max sequence length for inputs: 15074
Max sequence length for outputs: 5
程序中没有显示错误。有人可以帮忙吗?