我有以下字符串(第1阶段):
(Undergraduate level <A HREF="blah=">PHYS 218</A> Minimum Grade of D) and (Undergraduate level <A HREF="blah">MATH 152</A> Minimum Grade of D or Undergraduate level <A HREF="/blah=">MATH 172</A> Minimum Grade of D or Undergraduate level <A HREF="blah">MATH 251</A> Minimum Grade of D)
从此我去(第二阶段):
(Undergraduate level PHYS 218 Minimum Grade of D) and (Undergraduate level MATH 152 Minimum Grade of D or Undergraduate level MATH 172 Minimum Grade of D or Undergraduate level MATH 251 Minimum Grade of D)
然后最终我想要的是(第3阶段):
(PHYS 218) and (MATH 152 or MATH 172 or MATH 251)
目前我这样做的方式太可怕了。
我取第1阶段的字符串,完全删除所有a
标签并合并剩下的文字。
然后我从a
标签中取出课程编号,并将其放入上一步的字符串中,以进入第二阶段。
然后我在第二阶段查找课程,删除其左侧和右侧的所有内容,直到我点击(
,)
,or
,and
有什么方法可以使用正则表达式或其他东西干净利落地做到这一点?谢谢。
答案 0 :(得分:0)
x="""(Undergraduate level <A HREF="blah=">PHYS 218</A> Minimum Grade of D) and (Undergraduate level <A HREF="blah">MATH 152</A> Minimum Grade of D or Undergraduate level <A HREF="/blah=">MATH 172</A> Minimum Grade of D or Undergraduate level <A HREF="blah">MATH 251</A> Minimum Grade of D)"""
import re
print re.sub(r"<[^>]*>\s*|Undergraduate level\s*|Minimum Grade of [A-Z]+","",x)
如果格式始终是固定的并且不会发生太大变化,则可以使用re.sub
执行此操作。
参见演示。
https://regex101.com/r/hF7zZ1/2
编辑:
如果文字正在改变,请试试这个
x="""(Undergraduate level <A HREF="blah=">PHYS 218</A> Minimum Grade of D) and (Undergraduate level <A HREF="blah">MATH 152</A> Minimum Grade of D or Undergraduate level <A HREF="/blah=">MATH 172</A> Minimum Grade of D or Undergraduate level <A HREF="blah">MATH 251</A> Minimum Grade of D)"""
import re
print "".join(re.findall(r"(\(|\)|\s*or\s*|\s*and\s*|(?<=>)[^<]*(?=<\/A>))",x))