如何在单词和下一个单词之间提取文本?

时间:2016-10-28 11:43:56

标签: python regex python-2.7 python-3.x

我有以下示例文本:

mystr = r'''\documentclass[12pt]{article}
\usepackage{amsmath}
\title{\LaTeX}
\begin{document}
\section{Introduction}
This is introduction paragraph
\section{Non-Introduction}
This is non-introduction paragraph
\section{Sample section}
This is sample section paragraph
\begin{itemize}
  \item Item 1
  \item Item 2
\end{itemize}
\end{document}'''

我想要完成的是创建一个正则表达式,它将从mystr中提取以下行:

['This is introduction paragraph','This is non-introduction paragraph','    This is sample section paragraph\n \begin{itemize}\n\item Item 1\n\item Item 2\n\end{itemize}']

2 个答案:

答案 0 :(得分:2)

出于任何原因,您需要使用正则表达式。也许分裂字符串比仅仅" a"更多地涉及。 re模块也具有拆分功能:

import re
str_ = "a quick brown fox jumps over a lazy dog than a quick elephant"


print(re.split(r'\s?\ba\b\s?',str_))

# ['', 'quick brown fox jumps over', 'lazy dog than', 'quick elephant']

编辑:使用您提供的新信息扩大回答...

编辑后你写了一个更好的问题描述并且你包含了一个看起来像LaTeX的文本,我认为你需要提取那些不以\开头的行,这些是乳胶命令。换句话说,您需要只有文本的行。请尝试以下操作,始终使用正则表达式:

import re

mystr = r'''\documentclass[12pt]{article}
\usepackage{amsmath}
\title{\LaTeX}
\begin{document}
\section{Introduction}
This is introduction paragraph
\section{Non-Introduction}
This is non-introduction paragraph
\section{Sample section}
This is sample section paragraph
\end{document}'''

pattern = r"^[^\\]*\n"


matches = re.findall(pattern, mystr, flags=re.M)

print(matches)

# ['This is introduction paragraph\n', 'This is non-introduction paragraph\n', 'This is sample section paragraph\n']

答案 1 :(得分:0)

您可以使用split中的str方法:

my_string = "a quick brown fox jumps over a lazy dog than a quick elephant"
word = "a "
my_string.split(word)

结果:

['', 'quick brown fox jumps over ', 'lazy dog than ', 'quick elephant']

注意:不要将str用作变量名,因为它是Python中的关键字。