用下划线替换指定短语之间的空格

时间:2018-03-04 16:27:24

标签: python regex

我是一个文本文件,其中重要的短语用特殊符号表示。确切地说,它们将以<highlight>开头,以<\highlight>结尾。

例如,

"<highlight>machine learning<\highlight> is gaining more popularity, so do <highlight>block chain<\highlight>."

在这句话中,重要的短语按<highlight><\highlight>进行细分。

我需要移除<highlight><\highlight>,并用下划线替换连接它们所包围的单词的空格。即,将"<highlight>machine learning<\highlight>"转换为"machine_learning"。处理后的整个句子为"machine_learning is gaining more popularity, so do block_chain"

2 个答案:

答案 0 :(得分:1)

试试这个:

>>> text = "<highlight>machine learning<\\highlight> is gaining more popularity, so do <highlight>block chain<\\highlight>."
>>> re.sub(r"<highlight>(.*?)<\\highlight>", lambda x: x.group(1).replace(" ", "_"), text)
'machine_learning is gaining more popularity, so do block_chain.'

答案 1 :(得分:-1)

你去了:

import re
txt = "<highlight>machine learning<\\highlight> is gaining more popularity, so do <highlight>block chain<\\highlight>."

words = re.findall('<highlight>(.*?)<\\\highlight', txt)
for w in words:
    txt = txt.replace(w, w.replace(' ', '_'))
txt = txt.replace('<highlight>', '')
txt = txt.replace('<\highlight>', '')
print(txt)