正则表达式unicode字符与

时间:2016-08-26 15:48:26

标签: python regex unicode

我正在尝试在包含一些特殊字符的文本上使用正则表达式,例如à,è,ù等。

filter_2 = ur'(?:^\|\s+)?(?:(?:main_interests)|(?:influenced)|(?:influences))\s+?=[\s\W]+?(?:[\w}])*?([\d\w\s\-()*–&;\[\]|.<>:/",\']*)(?=\n)'
compiled = re.compile(filter_2, flags=re.U | re.M)
filter_list = re.findall(compiled, information)

以下文字是对表达式进行评估的结果。

  <[> [[PedroCalderóndela Barca |Calderón]],[[ChristianFürchtegottGellert| Gellert]],[[Oliver Goldsmith | Goldsmith]],[[Hafez]],[[Johann Gottfried Herder | Herder]], [[Homer]],[[Kālidāsa]],[[Kant]],[[Friedrich Gottlieb Klopstock | Klopstock]],[[Gotthold Ephraim Lessing | Lessing]],[[Carl Linnaeus | Linnaeus]],[[James] Macpherson | Macpherson],[[Jean-Jacques Rousseau | Rousseau]],[[Friedrich Schiller | Schiller]],[[William Shakespeare | Shakespeare]],[[Spin​​oza]],[[Emanuel Swedenborg | Swedenborg]], [[Karl Robert Mandelkow]],Bodo Morawe:Goethes Briefe。版本。卷。 1:Briefe der Jahre 1764-1786。 ''Christian Wegner'',1968年汉堡,p。 709 [[Johann Joachim Winckelmann | Winckelmann]]`

现在,当我尝试在上面的文本中使用另一个正则表达式来推断方括号中的单词时,结果是错误的。代表特殊字符的所有单词,如àù或è,将被删除,结果不是预期的结果。

filter_6 = ur'(?<=\[\[)([\w\s.-]+)((?=]])|(?=|))'
another_compiled = re.compile(filter_6, flags=re.U | re.M)
another_filtered_list = re.findall(another_compiled, (str(filter_list)))

这些是我的结果:

  

[('Pedro Calder',''),('Christian F',''),('Oliver Goldsmith',''),('Hafez',''),('Johann Gottfried Herder', ''),('荷马',''),('K',''),('Kant',''),('Friedrich Gottlieb Klopstock',''),('Gotthold Ephraim Lessing',' '',('Carl Linnaeus',''),('James Macpherson',''),('Jean-Jacques Rousseau',''),('Friedrich Schiller',''),('William Shakespeare' ,''),('斯宾诺莎',''),('Emanuel Swedenborg',''),('Karl Robert Mandelkow',''),('Johann Joachim Winckelmann',''),('Thomas Carlyle ',''),('Ernst Cassirer',''),('Charles Darwin',''),('Sigmund Freud',''),('G',''),('Andr', ''),('Hermann Hesse',''),(''GWF Hegel',''),('Muhammad Iqbal',''),('Daisaku Ikeda',''),('Carl Gustav Jung' ,''),('米兰昆德拉',''),('S',''),('Jean-Baptiste Lamarck',''),('Joaquim Maria Machado de Assis',''),( 'Thomas Mann','),('Friedrich Nietzsche',''),('France Pre',''),('Grigol Robakidze',''),('Friedrich Schiller',''),( 'Oswald Spengler',''),('Max Stirner',''),('周五edrich Wilhelm Joseph Schelling',''),('Arthur Schopenhauer',''),('Oswald Spengler',''),('Rudolf Steiner',''),('Henry David Thoreau','') ,('Nikola Tesla',''),('Lvanwig Wittgenstein',''),('Richard Wagner',''),('Leopold von Ranke','' )]

这些是我想要实现的结果

  

比赛1   1. [2-28] Pedro Calderón de la Barca   比赛2   1. [43-72] Christian Fürchtegott Gellert   比赛3   1. [86-102] Oliver Goldsmith   比赛4   1. [118-123] Hafez   比赛5   1. [129-152] Johann Gottfried Herder   比赛6   1. [165-170] Homer   比赛7   1. [176-184] Kālidāsa   比赛8   1. [190-194] Kant   比赛9   1. [200-228] Friedrich Gottlieb Klopstock   比赛10   1. [244-268] Gotthold Ephraim Lessing   比赛11   1. [282-295] Carl Linnaeus   比赛12   1. [310-326] James Macpherson   比赛13   1. [343-364] Jean-Jacques Rousseau   比赛14   1. [379-397] Friedrich Schiller   比赛15   1. [412-431] William Shakespeare   比赛16   1. [449-456] Spinoza   比赛17   1. [462-480] Emanuel Swedenborg   比赛18   1. [501-522] Karl Robert Mandelkow   比赛19   1. [659-685] Johann Joachim Winckelmann

所有正则表达式都经过在线测试,效果很好。有一种方法可以实际包含这些特殊字符吗?

1 个答案:

答案 0 :(得分:2)

Python 3 中,正则表达式无法编译。当我改变时,这似乎对我有用:

filter_6 = ur'(?<=\[\[)([\w\s.-]+)((?=]])|(?=|))'

只是一个unicode(非原始)字符串:

filter_6 = u'(?<=\[\[)([\w\s.-]+)((?=]])|(?=|))'

Python 2 中,我认为问题是将列表转换为字符串。将str(filter_list)更改为' '.join(filter_list)似乎对我有用。