最简单的清理字符串

时间:2014-01-29 21:06:50

标签: python string if-statement for-loop strip

清理用户输入的字符串的简单方法是什么? 这是我在清理混乱时依赖的代码。如果有一个更简单的智能版本可用,那就太好了。

invalid = ['#','@','$','$','%','^','&','*','(',')','-','+','!',' ']
for c in invalid: 
    if len(line)>0: line=line.replace(c,'')

PS我如何将这个(使用嵌套if)函数放在一行上?

7 个答案:

答案 0 :(得分:5)

import re
re.sub('[#@$%^&*()-+!]', '', line)

re是正则表达式模块。使用方括号意味着“匹配括号内的任何一个东西”。所以调用说,“在括号内的line中找到任何内容,并将其替换为空('')。

答案 1 :(得分:5)

最快的方法是使用str.translate

>>> invalid = ['#','@','$','$','%','^','&','*','(',')','-','+','!',' ']
>>> s = '@#$%^&*fdsfs#$%^&*FGHGJ'
>>> s.translate(None, ''.join(invalid))
'fdsfsFGHGJ'

时间比较

>>> s = '@#$%^&*fdsfs#$%^&*FGHGJ'*100

>>> %timeit re.sub('[#@$%^&*()-+!]', '', s)
1000 loops, best of 3: 766 µs per loop

>>> %timeit re.sub('[#@$%^&*()-+!]+', '', s)
1000 loops, best of 3: 215 µs per loop

>>> %timeit "".join(c for c in s if c not in invalid)
100 loops, best of 3: 1.29 ms per loop

>>> %timeit re.sub(invalid_re, '', s)
1000 loops, best of 3: 718 µs per loop

>>> %timeit s.translate(None, ''.join(invalid))         #Winner
10000 loops, best of 3: 17 µs per loop

在Python3上你需要做这样的事情:

>>> trans_tab = {ord(x):None for x in invalid}
>>> s.translate(trans_tab)
'fdsfsFGHGJ'

答案 2 :(得分:4)

你可以这样做:

from string import punctuation # !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~

line = "".join(c for c in line if c not in punctuation)

例如:

'hello, I @m pleased to meet you! How *about (you) try something > new?'

变为

'hello I m pleased to meet you How about you try something  new'

答案 3 :(得分:1)

这是正则表达式实际上有用的一种情况。

>>> invalid = ['#','@','$','$','%','^','&','*','(',')','-','+','!',' ']
>>> import re
>>> invalid_re = '|'.join(map(re.escape, invalid))
>>> re.sub(invalid_re, '', 'foo * bar')
'foobar'

答案 4 :(得分:1)

这是我在自己的代码中使用的代码段。您基本上使用正则表达式来指定允许的字符,匹配这些字符,然后将它们连接在一起。

import re

def clean(string_to_clean, valid='ACDEFGHIKLMNPQRSTVWY'):
    """Remove unwanted characters from string.

    Args:
    clean: (str) The string from which to remove
     unwanted characters.

     valid_chars: (str) The characters that are valid and should be
     included in the returned sequence. Default character
     set is: 'ACDEFGHIKLMNPQRSTVWY'.

     Returns: (str) A sequence without the invalid characters, as a string.

     """
    valid_string = r'([{}]+)'.format(valid)
    valid_regex = re.compile(valid_string, re.IGNORECASE)

    # Create string of matching characters, concatenate to string
    # with join().
    return (''.join(valid_regex.findall(string_to_clean)))

答案 5 :(得分:1)

使用简单的列表理解:

>>> invalid = ['#','@','$','$','%','^','&','*','(',')','-','+','!',' ']
>>> x = 'foo * bar'
>>> "".join(i for i in x if i not in invalid)
'foobar'

将列表理解与string.punctuation + \s一起使用:

>>> import string
>>> x = 'foo * bar'
>>> "".join(i for i in x if i not in string.punctuation)
'foo  bar'
>>> "".join(i for i in x if i not in string.punctuation+" ")
'foobar'

使用str.translate

>>> invalid = ['#','@','$','$','%','^','&','*','(',')','-','+','!',' ']
>>> x = 'foo * bar'
>>> x.translate(None,"".join(invalid))
'foobar'

使用re.sub

>>> import re
>>> invalid = ['#','@','$','$','%','^','&','*','(',')','-','+','!',' ']
>>> x = 'foo * bar'
>>> y = "["+"".join(invalid)+"]"
>>> re.sub(y,'',x)
'foobar'
>>> re.sub(y+'+','',x)
'foobar'

答案 6 :(得分:1)

这有效

invalid = '#@$%^_ '
line = "#master_Of^Puppets#@$%Yeah"
line = "".join([for l in line if l not in invalid])
#line will be - 'masterOfPuppetsYeah'