Question

我使用维基百科的API来获取一个简单的JSON对象，其中我有一个维基页面的第一段，我后来想要使用文本转语音将其读取给用户。但是，有些文章有适当的正确发音转录。例如，当我按照Using XSLT within ASP的链接时，JSON中的文本会出现如下：re.sub("\/.+\/", "", test)我的问题是，正则表达式将删除发音部分（并且可能删除任何Unicode）特殊字符：\ u和4个字符之后）？

尝试\只会在另一个\之后添加另一个decimal。

Answer 1

（假设您现在使用的是Python，因为您使用了re.sub，并且您只想删除/tʃɪˈwɑːwɑː/因为您的示例正则表达式。） < / p>

首先，您需要将Python的原始字符串表示法用于正则表达式模式，因为Python使用反斜杠表示其他内容（source）;将r放在正则表达式的字符串文字前面，原始示例可能就足够了。

无论如何，您已经走上了正确的轨道 - Unicode并不需要对您的示例案例进行任何特殊处理。你只需要删除两个斜杠之间的所有内容。我还限制了斜杠之间的匹配空格，这样您就不会捕获文档中相距很远的两个单斜线之间的所有内容。以下适用于Python 2.7.12 REPL：

>>> re.sub(r'\/[^/\s]+\/\s*', '', "The Chihuahua /t\u0283\u026a\u02c8w\u0251\u02d0w\u0251\u02d0/ (Spanish: chihuahue\u00f1o) is the smallest breed of dog")
'The Chihuahua (Spanish: chihuahue\\u00f1o) is the smallest breed of dog'

这里的正则表达式被分解了：

\/    # Match opening slash on the pronunciation expression
[^    # Begin a negated character set
  /     # Exclude the forward-slash /
  \s    # Also exclude all whitespace
]+    # Match one or more character that is not a slash or whitespace
\/    # Match closing slash on the pronunciation expression
\s*   # Capture any whitespace that follows, too

使用正则表达式删除特殊的Unicode字符？

1 个答案: