读取多个标签作为列表或以id为键的dict中的元组,即{id:(cat1,cat2,.....)}

时间:2018-08-02 00:32:53

标签: python-3.x dictionary machine-learning text-analysis multilabel-classification

我正在建模一个多标签文本分类算法。以下是我的labels.txt文件的摘要,我想将这些记录转换成由id和元组或列表中相应类别组成的字典,即{id:(cat1,cat2)}。记录不是用新行分隔的。我对如何将这种数据转换成字典感到困惑。

B0027DQHA0
  Movies & TV, TV
  Music, Classical
0756400120
  Books, Literature & Fiction, Anthologies & Literary Collections, General
  Books, Literature & Fiction, United States
  Books, Science Fiction & Fantasy, Science Fiction, Anthologies
  Books, Science Fiction & Fantasy, Science Fiction, Short Stories
B0000012D5
  Music, Blues
  Music, Pop
  Music, R&B

1 个答案:

答案 0 :(得分:1)

如果类别名称始终以空格缩进,而ID则不是,则可以使用它们来区分它们,并将类别名称附加到循环中由ID索引的字典中的列表:

r = '''B0027DQHA0
  Movies & TV, TV
  Music, Classical
0756400120
  Books, Literature & Fiction, Anthologies & Literary Collections, General
  Books, Literature & Fiction, United States
  Books, Science Fiction & Fantasy, Science Fiction, Anthologies
  Books, Science Fiction & Fantasy, Science Fiction, Short Stories
B0000012D5
  Music, Blues
  Music, Pop
  Music, R&B'''
d = {}
for l in r.splitlines():
    if l.startswith(' '):
        d.setdefault(i, []).append(l.lstrip())
    else:
        i = l
print(d)

这将输出:

{'B0027DQHA0': ['Movies & TV, TV', 'Music, Classical'], '0756400120': ['Books, Literature & Fiction, Anthologies & Literary Collections, General', 'Books, Literature & Fiction, United States', 'Books, Science Fiction & Fantasy, Science Fiction, Anthologies', 'Books, Science Fiction & Fantasy, Science Fiction, Short Stories'], 'B0000012D5': ['Music, Blues', 'Music, Pop', 'Music, R&B']}