> x y z z << / h1>
请注意，这也会影响/包括回车等。

Question

我有一个看起来像这样的字符串：

"stuff   .  // : /// more-stuff .. .. ...$%$% stuff -> DD"

我想剥离所有标点符号，将所有内容都设为大写并折叠所有空格，使其看起来像这样：

"STUFF MORE STUFF STUFF DD"

这可能是一个正则表达式还是我需要组合两个以上？这就是我到目前为止所做的：

def normalize(string):
    import re

    string = string.upper()

    rex   = re.compile(r'\W')
    rex_s = re.compile(r'\s{2,}')

    result = rex.sub(' ', string) # this produces a string with tons of whitespace padding
    result = rex.sub('', result) # this reduces all those spaces

    return result

唯一不起作用的是空白崩溃。有什么想法吗？

Answer 1

这是一个单步方法（但是大写实际上使用了一个字符串方法 - 更简单！）：

rex = re.compile(r'\W+')
result = rex.sub(' ', strarg).upper()

其中strarg是字符串参数（不使用影子内置函数或标准库模块的名称，请）。

Answer 2

s = "$$$aa1bb2 cc-dd ee_ff ggg."
re.sub(r'\W+', ' ', s).upper()
# ' AA1BB2 CC DD EE_FF GGG '

是_标点符号吗？

re.sub(r'[_\W]+', ' ', s).upper()
# ' AA1BB2 CC DD EE FF GGG '

不想要前导和尾随空格？

re.sub(r'[_\W]+', ' ', s).strip().upper()
# 'AA1BB2 CC DD EE FF GGG'

Answer 3

result = rex.sub(' ', string) # this produces a string with tons of whitespace padding
result = rex.sub('', result) # this reduces all those spaces

因为你错了并且忘了使用rex_s代替第二次调用。此外，你需要替换至少一个空格，或者你最终会得到任何多空间间隙，而不是单空间隙。

result = rex.sub(' ', string) # this produces a string with tons of whitespace padding
result = rex_s.sub(' ', result) # this reduces all those spaces

Answer 4

你必须使用正则表达式吗？你觉得你必须一行吗？

>>> import string
>>> s = "stuff   .  // : /// more-stuff .. .. ...$%$% stuff -> DD"
>>> s2 = ''.join(c for c in s if c in string.letters + ' ')
>>> ' '.join(s2.split())
'stuff morestuff stuff DD'

Answer 5

在python3中工作，这将保留您折叠的相同空白字符。因此，如果你有一个标签和一个彼此相邻的空格，它们就不会折叠成一个字符。

def collapse_whitespace_characters(raw_text):
    ret = ''
    if len(raw_text) > 1:
        prev_char = raw_text[0]
        ret += prev_char
        for cur_char in raw_text[1:]:
            if not cur_char.isspace() or cur_char != prev_char:
                ret += cur_char
            prev_char = cur_char
    else:
        ret = raw_text
    return ret

这个会将空格集折叠成它看到的第一个空白字符

def collapse_whitespace(raw_text):
    ret = ''
    if len(raw_text) > 1:
        prev_char = raw_text[0]
        ret += prev_char
        for cur_char in raw_text[1:]:
            if not cur_char.isspace() or \
                    (cur_char.isspace() and not prev_char.isspace()):
                ret += cur_char
            prev_char = cur_char
    else:
        ret = raw_text
    return ret

＆GT;＆GT;＆GT; collapse_whitespace_characters（ '我们喜欢的空间和\ t \ t TABS和任何\ XA0 \ xa0IS'）
'我们喜欢空格和\ t TABS \ tAND WHATEVER \ xa0IS'

＆GT;＆GT;＆GT; collapse_whitespace（ '我们喜欢的空间和\ t \ t TABS和任何\ XA0 \ xa0IS'）
'我们喜欢空格和\ tTABS \ tAND WHATEVER \ xa0IS'

用于标点符号

def collapse_punctuation(raw_text):
    ret = ''
    if len(raw_text) > 1:
        prev_char = raw_text[0]
        ret += prev_char
        for cur_char in raw_text[1:]:
            if cur_char.isalnum() or cur_char != prev_char:
                ret += cur_char
            prev_char = cur_char
    else:
        ret = raw_text
    return ret

实际回答问题

orig = 'stuff   .  // : /// more-stuff .. .. ...$%$% stuff -> DD'
collapse_whitespace(''.join([(c.upper() if c.isalnum() else ' ') for c in orig]))

如上所述，正则表达式将类似于

re.sub('\W+', ' ', orig).upper()

Answer 6

可以使用正则表达式替换重复出现的空格。由\s赋予空白，其中\s+的含义是：至少一个。

import re
rex = re.compile(r'\s+')
test = "     x  y z           z"
res = rex.sub(' ', test)
print(f">{res}<")

> x y z z << / h1>
请注意，这也会影响/包括回车等。

折叠字符串中的空格

6 个答案:

> x y z z << / h1>
请注意，这也会影响/包括回车等。

折叠字符串中的空格

6 个答案:

> x y z z << / h1> 请注意，这也会影响/包括回车等。

> x y z z << / h1>
请注意，这也会影响/包括回车等。