我正在尝试在包含一些特殊字符的文本上使用正则表达式,例如à,è,ù等。
filter_2 = ur'(?:^\|\s+)?(?:(?:main_interests)|(?:influenced)|(?:influences))\s+?=[\s\W]+?(?:[\w}])*?([\d\w\s\-()*–&;\[\]|.<>:/",\']*)(?=\n)'
compiled = re.compile(filter_2, flags=re.U | re.M)
filter_list = re.findall(compiled, information)
以下文字是对表达式进行评估的结果。
<[> [[PedroCalderóndela Barca |Calderón]],[[ChristianFürchtegottGellert| Gellert]],[[Oliver Goldsmith | Goldsmith]],[[Hafez]],[[Johann Gottfried Herder | Herder]], [[Homer]],[[Kālidāsa]],[[Kant]],[[Friedrich Gottlieb Klopstock | Klopstock]],[[Gotthold Ephraim Lessing | Lessing]],[[Carl Linnaeus | Linnaeus]],[[James] Macpherson | Macpherson],[[Jean-Jacques Rousseau | Rousseau]],[[Friedrich Schiller | Schiller]],[[William Shakespeare | Shakespeare]],[[Spinoza]],[[Emanuel Swedenborg | Swedenborg]], [[Karl Robert Mandelkow]],Bodo Morawe:Goethes Briefe。版本。卷。 1:Briefe der Jahre 1764-1786。 ''Christian Wegner'',1968年汉堡,p。 709 [[Johann Joachim Winckelmann | Winckelmann]]`
现在,当我尝试在上面的文本中使用另一个正则表达式来推断方括号中的单词时,结果是错误的。代表特殊字符的所有单词,如àù或è,将被删除,结果不是预期的结果。
filter_6 = ur'(?<=\[\[)([\w\s.-]+)((?=]])|(?=|))'
another_compiled = re.compile(filter_6, flags=re.U | re.M)
another_filtered_list = re.findall(another_compiled, (str(filter_list)))
这些是我的结果:
[('Pedro Calder',''),('Christian F',''),('Oliver Goldsmith',''),('Hafez',''),('Johann Gottfried Herder', ''),('荷马',''),('K',''),('Kant',''),('Friedrich Gottlieb Klopstock',''),('Gotthold Ephraim Lessing',' '',('Carl Linnaeus',''),('James Macpherson',''),('Jean-Jacques Rousseau',''),('Friedrich Schiller',''),('William Shakespeare' ,''),('斯宾诺莎',''),('Emanuel Swedenborg',''),('Karl Robert Mandelkow',''),('Johann Joachim Winckelmann',''),('Thomas Carlyle ',''),('Ernst Cassirer',''),('Charles Darwin',''),('Sigmund Freud',''),('G',''),('Andr', ''),('Hermann Hesse',''),(''GWF Hegel',''),('Muhammad Iqbal',''),('Daisaku Ikeda',''),('Carl Gustav Jung' ,''),('米兰昆德拉',''),('S',''),('Jean-Baptiste Lamarck',''),('Joaquim Maria Machado de Assis',''),( 'Thomas Mann','),('Friedrich Nietzsche',''),('France Pre',''),('Grigol Robakidze',''),('Friedrich Schiller',''),( 'Oswald Spengler',''),('Max Stirner',''),('周五edrich Wilhelm Joseph Schelling',''),('Arthur Schopenhauer',''),('Oswald Spengler',''),('Rudolf Steiner',''),('Henry David Thoreau','') ,('Nikola Tesla',''),('Lvanwig Wittgenstein',''),('Richard Wagner',''),('Leopold von Ranke','' )]
这些是我想要实现的结果
比赛1 1. [2-28]
Pedro Calderón de la Barca
比赛2 1. [43-72]Christian Fürchtegott Gellert
比赛3 1. [86-102]Oliver Goldsmith
比赛4 1. [118-123]Hafez
比赛5 1. [129-152]Johann Gottfried Herder
比赛6 1. [165-170]Homer
比赛7 1. [176-184]Kālidāsa
比赛8 1. [190-194]Kant
比赛9 1. [200-228]Friedrich Gottlieb Klopstock
比赛10 1. [244-268]Gotthold Ephraim Lessing
比赛11 1. [282-295]Carl Linnaeus
比赛12 1. [310-326]James Macpherson
比赛13 1. [343-364]Jean-Jacques Rousseau
比赛14 1. [379-397]Friedrich Schiller
比赛15 1. [412-431]William Shakespeare
比赛16 1. [449-456]Spinoza
比赛17 1. [462-480]Emanuel Swedenborg
比赛18 1. [501-522]Karl Robert Mandelkow
比赛19 1. [659-685]Johann Joachim Winckelmann
所有正则表达式都经过在线测试,效果很好。有一种方法可以实际包含这些特殊字符吗?
答案 0 :(得分:2)
在 Python 3 中,正则表达式无法编译。当我改变时,这似乎对我有用:
filter_6 = ur'(?<=\[\[)([\w\s.-]+)((?=]])|(?=|))'
只是一个unicode(非原始)字符串:
filter_6 = u'(?<=\[\[)([\w\s.-]+)((?=]])|(?=|))'
在 Python 2 中,我认为问题是将列表转换为字符串。将str(filter_list)
更改为' '.join(filter_list)
似乎对我有用。