Question

我有以下格式的文档，例如，我想用python对其进行分类

Outline: 
1. Lorem Ipsum 
2. Lorem Ipsum 

Preface: 
This is sample generated words of the documents

必须将其分类为数组，例如

[Outline: 1. Lorem Ipsum 2. Lorem Ipsum, Preface: This is sample generated words of the documents ]

或存储在其他变量中，例如

outline = segment_by_word("outline")
preface = segment_by_word("preface")

print(preface )  #This is sample generated words of the documents

Answer 1

我假设只有Ouline和Preface两类。下面的代码将这些行作为元组添加到列表中，其中包含行号和行信息

lines_by_category = {'Outline': [], 'Preface': []}
category = None
count = 0

for line in lines:  # Assuming you know how to get to the point of reading lines
    if line.find(r'Outline:'):
        category = 'Outline'
    elif line.find(r'Preface:'):
        category = 'Preface'
    category_list = lines_by_category[category]
    category_list.append((count, line))  # Updates the original list because it is pointing to the same one

根据关键字对文本进行分类

1 个答案: