python正则表达式,替换字符串中的模式

时间:2012-12-05 14:39:40

标签: python regex

我想用wiki标记替换字符串中的一些子串。例如。我有一个字符串

some other string before
; Methods
{{columns-list|3|
* [[Anomaly detection|Anomaly/outlier/change detection]]
* [[Association rule learning]]
* [[Statistical classification|Classification]]
* [[Cluster analysis]]
* [[Decision trees]]
* [[Factor analysis]]
* [[Neural Networks]]
* [[Regression analysis]]
* [[Structured data analysis (statistics)|Structured data analysis]]
* [[Sequence mining]]
* [[Text mining]]
}}

; Application domains
{{columns-list|3|
* [[Analytics]]
* [[Bioinformatics]]
* [[Business intelligence]]
* [[Data analysis]]
* [[Data warehouse]]
* [[Decision support system]]
* [[Drug Discovery]]
* [[Exploratory data analysis]]
* [[Predictive analytics]]
* [[Web mining]]
}}
some other string after

我想用

替换原始子字符串
[[Anomaly detection|Anomaly/outlier/change detection]]
[[Association rule learning]]
[[Statistical classification|Classification]]
[[Cluster analysis]]
[[Decision trees]]
[[Factor analysis]]
[[Neural Networks]]
[[Regression analysis]]
[[Structured data analysis (statistics)|Structured data analysis]]
[[Sequence mining]]
[[Text mining]]
[[Analytics]]
[[Bioinformatics]]
[[Business intelligence]]
[[Data analysis]]
[[Data warehouse]]
[[Decision support system]]
[[Drug Discovery]]
[[Exploratory data analysis]]
[[Predictive analytics]]
[[Web mining]]

我已经尝试了一些正则表达式来首先在{{}}中提取内容。但我总是没有。

ADD:问题是我只对{[]]中的内容感兴趣,[{]]中的内容位于{{}}。我在字符串的其他部分有一些[[]]。

那么,我怎么能用re.sub做到这一点?感谢

ADD:当前解决方案(丑陋)

def regt(matchobj):
  #store matchobj.group(0) somewhere else, later on add them to the string
  #Next, another function will remove all {{}} alway
  return ''

matches = re.sub(r'\[\[.*?\]\](?=[^{]*\}\})', regt,wiki_string2)

3 个答案:

答案 0 :(得分:0)

尝试使用非贪婪的正则表达式,例如:     R “\ {\ {*?\} \}”

答案 1 :(得分:0)

匹配而不是replacing

\[\[.*?\]\](?=[^{]*\}\})

.*?匹配lazily.so它会在第一次]]发生时停止

.*与贪婪匹配。因为它会在]]上次发生时停止


(?=[^{]*}})lookahead,表示只有在[[ ]]之后跟0到多个字符的{之后才匹配}}内的内容。< / p>

这样做是因为如果[[``]]位于{{ }}之内,您希望匹配]]

{之后的字符除了}}之外的任何字符都是[[xyz]]<-this would not match since { after it {{ [[xyz]]<-this would match since it is not followed by { and it reaches }} [[xyz]]<-this would match since it is not followed by { and it reaches }} }} ..

所以这会避免像这样的情况

{{1}}

答案 2 :(得分:0)

您可以尝试以下操作:

In [10]: p = "\[\[.*?\]\]"
In [11]: s1 = '\n'.join(re.findall(p, s))

<强>更新 使用附加约束(只有{{}}内的文字匹配),您可以通过两个步骤实现目标:

  • 选择大括号内的文字
  • 然后选择方形布料中的文字

您可以按照以下方式执行此操作(我使用包含不匹配的方形文本中的文本的源字符串):

In [157]: print s
some [[other string before]]
Methods("")
{{columns-list|3|
* [[Cluster analysis]]
* [[Decision trees]]
* [[Factor analysis]]
}}
Application("domains")
{{columns-list|3|
* [[Analytics]]
* [[Bioinformatics]]
* [[Web mining]]
}}
some [[other string after]]

In [158]: p = "(?:\{\{)[\s\S]*?(?:\}\})"

In [159]: s1 = '\n'.join(re.findall(p, s))

In [160]: print s1
{{columns-list|3|
* [[Cluster analysis]]
* [[Decision trees]]
* [[Factor analysis]]
}}
{{columns-list|3|
* [[Analytics]]
* [[Bioinformatics]]
* [[Web mining]]
}}

In [161]: p1 = "\[\[.*\]\]"

In [162]: s2 = '\n'.join(re.findall(p1, s1))

In [163]: print s2
[[Cluster analysis]]
[[Decision trees]]
[[Factor analysis]]
[[Analytics]]
[[Bioinformatics]]
[[Web mining]]