如果列表包含unicode和非ascii字符,如何从列表中删除转义序列?

时间:2016-09-08 15:26:13

标签: python-2.7 unicode ascii

我正在拉一些licensure data并将其放入列表中。

rank = ['\r\n\t\t', 'RANK2', 'Rank II', '07', '-', '01', '-', '2016', u'\xa0', '06', '-', '30', '-', '2021', u'\xa0', '\r\n\t']
cert = ['\r\n\t\t', 'KEL', 'Professional Certificate For Teaching In Elementary School, Primary Through Grade 5', '07', '-', '01', '-', '2016', u'\xa0', '06', '-', '30', '-', '2021', u'\xa0', '\r\n\t']

我想从列表中删除unicode字符和非ascii字符,最终让我的列表看起来像这样:

rank = ['RANK2', 'Rank II', '07-01-2016', '06-30-2021']
cert = ['KEL', 'Professional Certificate For Teaching In Elementary School, Primary Through Grade 5', '07-01-2016', '06-30-2021']

我已经查看了remove escape sequences from listsremove unicoderemove non-ascii和一些others的其他一些问题,但我无法让它们适应我的情况。

有些人接近但没有雪茄:

[word for word in cert if word.isalnum()]
>>> ['KEL', '07', '01', '2016', '06', '30', '2021']

def recursive_map(lst, fn):
    return [recursive_map(x, fn) if isinstance(x, list) else fn(x) for x in lst]
recursive_map(rank, lambda x: x.encode("ascii", "ignore"))
>>>['\r\n\t\t', 'RANK2', 'Rank II', '07', '-', '01', '-', '2016', '', '06', '-', '30', '-', '2021', '', '\r\n\t']    

此刻我陷入困境......任何人都有任何想法?

1 个答案:

答案 0 :(得分:1)

这里有一些快速的东西:

rank = ['\r\n\t\t', 'RANK2', 'Rank II', '07', '-', '01', '-', '2016', u'\xa0', '06', '-', '30', '-', '2021', u'\xa0', '\r\n\t']
cert = ['\r\n\t\t', 'KEL', 'Professional Certificate For Teaching In Elementary School, Primary Through Grade 5', '07', '-', '01', '-', '2016', u'\xa0', '06', '-', '30', '-', '2021', u'\xa0', '\r\n\t']

def clean(L):
    '''Removes non-printable characters and filters result for empty strings.
    '''
    cleaned = [scrubbed(x) for x in L if scrubbed(x)]
    # I use a character not in the ASCII range to rejoin the hyphenated dates.
    return '\xa0'.join(cleaned).replace('\xa0-\xa0','-').split('\xa0')

def scrubbed(s):
    '''Removed control and non-ASCII characters.
    '''
    return ''.join([n for n in s if 32 <= ord(n) <= 127])

print(clean(rank))
print(clean(cert))

输出:

['RANK2', 'Rank II', '07-01-2016', '06-30-2021']
['KEL', 'Professional Certificate For Teaching In Elementary School, Primary Through Grade 5', '07-01-2016', '06-30-2021']