根据关键字对文本进行分类

时间:2019-04-21 13:42:51

标签: python python-3.x

我有以下格式的文档,例如,我想用python对其进行分类

Outline: 
1. Lorem Ipsum 
2. Lorem Ipsum 

Preface: 
This is sample generated words of the documents

必须将其分类为数组,例如

[Outline: 1. Lorem Ipsum 2. Lorem Ipsum, Preface: This is sample generated words of the documents ]

或存储在其他变量中,例如

outline = segment_by_word("outline")
preface = segment_by_word("preface")

print(preface )  #This is sample generated words of the documents  

1 个答案:

答案 0 :(得分:0)

我假设只有OulinePreface两类。下面的代码将这些行作为元组添加到列表中,其中包含行号和行信息

lines_by_category = {'Outline': [], 'Preface': []}
category = None
count = 0

for line in lines:  # Assuming you know how to get to the point of reading lines
    if line.find(r'Outline:'):
        category = 'Outline'
    elif line.find(r'Preface:'):
        category = 'Preface'
    category_list = lines_by_category[category]
    category_list.append((count, line))  # Updates the original list because it is pointing to the same one