Question

我知道StackOverflow上有类似的问题。我尝试调整一些方法，但我无法工作，这符合我的需求：

鉴于 python string 我想剥离每个非字母数字字符 - 但 - 留下任何特殊字符，如μæÅÇß< / strong> ... 这甚至可能吗？与正则表达式我尝试了这个

的变体
re.sub(r'[^a-zA-Z0-9: ]', '', x) # x is my string to sanitize

但是我想要的更多。我想要的一个例子是：

Input: "A string, with characters µ, æ, Å, Ç, ß,... Some whitespace confusion ?" Output: "A string with characters µ æ Å Ç ß Some whitespace confusion"

这是否可能没有变得复杂？

Answer 1

使用\ w设置UNICODE标志。这也将与下划线匹配，因此您可能需要单独处理。

http://docs.python.org/library/re.html的详细信息。

编辑：这是一些实际的代码。它将保留unicode字母，unicode数字和空格。

import re
x = u'$a_bßπ7: ^^@p'
pattern = re.compile(r'[^\w\s]', re.U)
re.sub(r'_', '', re.sub(pattern, '', x))

如果你没有使用re.U那么ß和π字符就会被删除。

抱歉，我无法想办法用一个正则表达式做到这一点。如果可以，您可以发布解决方案吗？

Answer 2

消除“标点，其他”Unicode类别中的字符。

# -*- coding: utf-8 -*-

import unicodedata

# This removes punctuation characters.
def strip_po(s):
  return ''.join(x for x in s if unicodedata.category(x) != 'Po')

# This reduces multiple whitespace characters into a single space.
def fix_space(s):
  return ' '.join(s.split())

s = u'A string, with characters µ, æ, Å, Ç, ß,... Some    whitespace  confusion  ?'
print fix_space(strip_po(s))

Answer 3

您必须更好地定义特殊字符的含义。有一些标志会将诸如空格，非空格，数字等内容组合在一起，并且特定于区域设置。有关详细信息，请参阅http://docs.python.org/library/re.html。

但是，由于这是逐字符操作，您可能会发现更容易简单地明确指定每个字符，或者，如果要排除的字符数较小，则编写仅排除这些字符的表达式。

Answer 4

如果您对Unicode Consortium对字母或数字的分类感到满意，那么在没有RegEx或导入内置插件之外的任何内容的情况下，这是一种简单的方法：

filter(unicode.isalnum, u"A string, with characters µ, æ, Å, Ç, ß,... Some    whitespace  confusion  ?")

如果你有str而不是unicode，则需要先对其进行编码。

从python中的字符串中删除非字母数字字符，但保留特殊字符

4 个答案: