Python正则表达式将外部文本与标签之间的文本组合在一起

时间:2015-07-21 08:57:55

标签: python regex

我有以下字符串(第1阶段):

(Undergraduate level  <A HREF="blah=">PHYS 218</A> Minimum Grade of D) and (Undergraduate level  <A HREF="blah">MATH 152</A> Minimum Grade of D or Undergraduate level  <A HREF="/blah=">MATH 172</A> Minimum Grade of D or Undergraduate level  <A HREF="blah">MATH 251</A> Minimum Grade of D)

从此我去(第二阶段):

(Undergraduate level PHYS 218 Minimum Grade of D) and (Undergraduate level MATH 152 Minimum Grade of D or Undergraduate level MATH 172 Minimum Grade of D or Undergraduate level MATH 251 Minimum Grade of D)

然后最终我想要的是(第3阶段):

(PHYS 218) and (MATH 152 or MATH 172 or MATH 251)

目前我这样做的方式太可怕了。

我取第1阶段的字符串,完全删除所有a标签并合并剩下的文字。

然后我从a标签中取出课程编号,并将其放入上一步的字符串中,以进入第二阶段。

然后我在第二阶段查找课程,删除其左侧和右侧的所有内容,直到我点击()orand

有什么方法可以使用正则表达式或其他东西干净利落地做到这一点?谢谢。

1 个答案:

答案 0 :(得分:0)

x="""(Undergraduate level  <A HREF="blah=">PHYS 218</A> Minimum Grade of D) and (Undergraduate level  <A HREF="blah">MATH 152</A> Minimum Grade of D or Undergraduate level  <A HREF="/blah=">MATH 172</A> Minimum Grade of D or Undergraduate level  <A HREF="blah">MATH 251</A> Minimum Grade of D)"""
import re
print re.sub(r"<[^>]*>\s*|Undergraduate level\s*|Minimum Grade of [A-Z]+","",x)

如果格式始终是固定的并且不会发生太大变化,则可以使用re.sub执行此操作。

参见演示。

https://regex101.com/r/hF7zZ1/2

编辑:

如果文字正在改变,请试试这个

x="""(Undergraduate level  <A HREF="blah=">PHYS 218</A> Minimum Grade of D) and (Undergraduate level  <A HREF="blah">MATH 152</A> Minimum Grade of D or Undergraduate level  <A HREF="/blah=">MATH 172</A> Minimum Grade of D or Undergraduate level  <A HREF="blah">MATH 251</A> Minimum Grade of D)"""
import re
print "".join(re.findall(r"(\(|\)|\s*or\s*|\s*and\s*|(?<=>)[^<]*(?=<\/A>))",x))