Question

使用正则表达式从字符串中删除非字母数字字符有几个问题。我想要做的是删除第一个不是字母或单个空格的字符（包括数字和双空格）后的每个字符，包括字母。

例如：

My string is #not very beautiful

应该成为

My string is

或

Are you 9 years old?

应该成为

Are you

和

this is the last  example

应该成为

this is the last

我如何做到这一点？

Answer 1

split上的[^A-Za-z ]|和第一个元素怎么样？您可以稍后修剪可能的空格：

import re
re.split("[^A-Za-z ]|  ", "My string is #not very beautiful")[0].strip()
# 'My string is'

re.split("[^A-Za-z ]|  ", "this is the last  example")[0].strip()
# 'this is the last'

re.split("[^A-Za-z ]|  ", "Are you 9 years old?")[0].strip()
# 'Are you'

[^A-Za-z ]|包含两种模式，第一种模式是单个字符，既不是字母也不是空格;第二种模式是双白空间;拆分这两种模式中的一种，拆分后的第一个元素应该是你要找的。

Answer 2

创建白名单并在看到不在该白名单中的内容时停止：

import itertools
import string

def rstrip(s, whitelist=None):
    if whitelist is None:
        whitelist = set(string.ascii_letters + ' ')  # set the whitelist to a default of all letters A-Z and a-z and a space
    # split on double-whitespace and take the first split (this will work even if there's no double-whitespace in the string)
    # use `itertools.takewhile` to include the characters that in the whitelist
    # use `join` to join them inot one single string

    return ''.join(itertools.takewhile(whitelist.__contains__, s.split('  ', 1)[0]))

Answer 3

import re
str1 = "this is the last  example"
regex = re.compile(r"(([a-zA-Z]|(\s[a-zA-Z]))+)")
capture = re.match(regex, str1)
res = capture.group(1)

我也用你的其他例子测试了它，它似乎给出了正确的结果。请注意，这不会保留尾随空格，这就是您的示例所显示的内容，即使这不是您想要的内容。

Answer 4

强制性表达

def truncate_nonalpha_space(s):
    return s[:next((x for x, a in enumerate(s.split("  ")[0]) if not a.isalpha() and not a == " "), len(s))].rstrip()

步骤：

形成一个表达式，以通过.isalpha()方法获取不是字母的值的索引或等于" "
" "上s分割的左侧用于在弹出表达式时处置任何双空白实例
枚举此剩余部分以获取字符串的列表索引（现在它本身就是一个列表）
这些值中的第一个用于对s进行切片，否则将返回所有s s[:len(s)]去除右空白.rstrip()

Answer 5

^.+?(?=[^A-Za-z ]|$|\s{2})

您可以使用此方法获取输出。使用re.findall获取输出。

参见演示。

https://regex101.com/r/INzotJ/1

Answer 6

Hacky，但使用 yield ：

import string

li_test = [
    ("My string is #not very beautiful","My string is"),
    ("Are you 9 years old?","Are you "),
    ("this is the last  example","this is the last "),
]

tolerated = string.ascii_letters

def rstrip_(s_in):
    last = None
    for char in s_in:
        if char in tolerated:
            last = char
            yield char
        elif char == ' ':
            if last == ' ':
                raise StopIteration()
            last = char
            yield char
        else:                    
            raise StopIteration()

for input_, exp in li_test:
    got = "".join(rstrip_(input_))
    msg = ":%s:<>:%s:" % (exp, got)
    print (":%s:=>:%s:" % (input_, got))
    #cheating a bit because I dunno if the last space is wanted.
    assert exp.rstrip() == got.rstrip(), msg

输出：

 :My string is #not very beautiful:=>:My string is :
 :Are you 9 years old?:=>:Are you :
 :this is the last  example:=>:this is the last :

而且，是的，我应该将整个事情包装在第二个函数中并加入那里的角色......

删除python中字符串中不是字母的第一个字符后面的任何内容

6 个答案: