基于直接句的文本分割

时间:2018-09-02 06:19:06

标签: python regex text-segmentation

假设我有这样的docx文件:

  

我小时候,父亲带我去城里看   军乐队。       他说:“儿子长大后会成为伤者的救星吗?”       父亲坐在我旁边,双臂拥抱我的肩膀。       我说:“我愿意。”       父亲回答“那是我的男孩!”

我想基于直接句子对docx进行细分。像这样:

  

发送1:他说:“儿子,长大后,您将成为救世主的救世主。   坏了吗?”

     

发送2:我说:“我愿意。”

     

sent3:我父亲回答“那是我的男孩!”

我尝试使用正则表达式。结果就是这个

When I was a young boy my father took me into the city to see a marching band.

He said, "Son when you grow up would you be the savior of the broken?

".

My father sat beside me, hugging my shoulders with both of his arms.

I said "I Would.

".

My father replied "That is my boy!

正则表达式代码:

import re
SENTENCE_REGEX = re.compile('[^!?\.]+[!?\.]')
text = open ('text.docx','r')

def parse_sentences(text):
   return [x.lstrip() for x in SENTENCE_REGEX.findall(text)]

def print_sentences(sentences):
    print ("\n\n".join(sentences))

if __name__ == "__main__":
    print_sentences(parse_sentences(text))

1 个答案:

答案 0 :(得分:0)

foreach ($table_data as &$mt) {
    foreach ($items as $it) {
        if ($it['drug_id'] == $mt['drug_id']) {
            if (!isset($mt['decisions'])) {
                $mt['decisions'] = [];
            }
            $mt['decisions'][] = $it;
        }
    }
}

输出:

import re

txt = '''When I was a young boy my father took me into the city to see a marching band. He said, "Son when you grow up would you be the savior of the broken?" My father sat beside me, hugging my shoulders with both of his arms. I said "I Would." My father replied "That is my boy!"'''

pttrn = re.compile(r'(\.|\?|\!)(\'|\")?\s')

new = re.sub(pttrn, r'\1\2\n\n', txt)

print(new)

PS: 据我所知,When I was a young boy my father took me into the city to see a marching band. He said, "Son when you grow up would you be the savior of the broken?". My father sat beside me, hugging my shoulders with both of his arms. I said "I Would." My father replied "That is my boy!" ?"..".之类的结尾均不得使用英语。