Question

基本上我需要在具有字符和数字组合的数据帧中使用列，例如＆＃39; XYZABC / 123441 s sdx＆＃39;和类似的类型

我需要删除所有标点符号，单个字母单词，用单个空格替换双空格，修剪字符串，并用＆＃34; NUMB＃＆＃34;替换数字。在哪里＆＃39;＃＆＃39;表示数字的长度。所以＆＃39; 123441＆＃39;这里将替换为＆＃34; NUMB6＆＃34;等等。

我目前的代码是：

for x in df["colname"]:
    x = re.sub(r"[^\w\s]", " ", str(x))      #Removes all punctuations
    x = re.sub(r"\d+", "NUMB", str(x))       #Replaces digits with 'NUMB'
    x = re.sub(r"\b[a-zA-Z]\b", "", str(x))  #Removes all single characters
    x = re.sub(r"\s+", " ", str(x))          #Removes double spaces with single space
    x = x.strip().upper()                    #Trims the string

现在我确实在网站上看到了如何用长度替换子串的问题：

re.sub(r'\b([A-Z][a-z]*)\b', lambda m: str(len(m.group(1))), s)

我需要做的就是替换＆＃34;（[A-Z] [a-z] *）＆＃34;用＆＃39; \ d＆＃39;。但是，我不知道如何将两者加在一起，＆＃39; .append＆＃39;功能不起作用。这可能是一个基本的东西，但我是Python的新手，所以我不知道如何做到这一点

Answer 1

您可以使用apply之类的

def repl(x):
    return re.sub(r'\d+', lambda m: "NUMB{}".format(len(m.group())), x)

 df['colname'] = df['colname'].apply(repl)

或者使用与代码中相同的逻辑，将x = re.sub(r"\d+", "NUMB", str(x))替换为

x = re.sub(r'\d+', lambda m: "NUMB{}".format(len(m.group())), x)

re.sub(r'\d+', lambda m: "NUMB{}".format(len(m.group())), x)将找到任何非重叠的数字块，并将用NUMB和数字块的长度替换它们。

用字符串替换字符串中的数字以及附加到其上的数字的长度

1 个答案: