Python Regex用空格打印(仅限字母)单词并排除非ascii字符

时间:2016-12-17 23:06:09

标签: python regex

我的python函数定义如下:

def name_extractor(dirty_name):
    print Name
    clean_name = re.sub('\W'," ", dirty_name)
    print clean_name

脏名称的样本包含:

(10) Johny Doe
Eric E. Shelby
(1) Chris Melton - ŗ≤ēŗ≤Ņŗ≤įŗ≤Ņŗ≤ēŗ≥ć ŗ≤ēŗ≥Äŗ≤įŗ≥ćŗ≤§ŗ≤Ņ
Jonas Alexander Bay
Christopher Rockstar - An awesome guy
Jones Collier

我想要输出只打印:

Johny Doe
Eric E. Shelby
Chris Melton
Jonas Alexander Bay
Christopher Rockstar
Jones Collier

如何调整正则表达式以仅按原样打印名称并排除&#34之后的所有内容(随机字符或正常的ascii字符) - "?

2 个答案:

答案 0 :(得分:2)

您不需要正则表达式。拆分' - '上的每一行,然后过滤掉你不想要的字符,剥去额外的空格:

>>> l = '''(10) Johny Doe
... Eric E. Shelby
... (1) Chris Melton - ŗ≤ēŗ≤Ņŗ≤įŗ≤Ņŗ≤ēŗ≥ć ŗ≤ēŗ≥Äŗ≤įŗ≥ćŗ≤§ŗ≤Ņ
... Jonas Alexander Bay
... Christopher Rockstar - An awesome guy
... Jones Collier'''.splitlines()
>>> for line in l:
...     print(''.join(c for c in line.split(' - ')[0] if c.isalpha() or c in ' .').strip())
...
Johny Doe
Eric E. Shelby
Chris Melton
Jonas Alexander Bay
Christopher Rockstar
Jones Collier

答案 1 :(得分:0)

要排除所有非ascii字符以及在连字符-之后的所有其他字符 - 用空字符串""替换它们就足够了。
使用特定正则表达式模式的简短解决方案:

dirty_name = '''
(10) Johny Doe
Eric E. Shelby
(1) Chris Melton - ŗ≤ēŗ≤Ņŗ≤įŗ≤Ņŗ≤ēŗ≥ć ŗ≤ēŗ≥Äŗ≤įŗ≥ćŗ≤§ŗ≤Ņ
Jonas Alexander Bay
Christopher Rockstar - An awesome guy
Jones Collier'''

clean_name = '\n'.join(l.lstrip() for l in re.sub(r'[^\x00-\x7f]|[\d()]| - .+\b(?=\n)', "", dirty_name).split('\n'))
print(clean_name)

输出:

Johny Doe
Eric E. Shelby
Chris Melton
Jonas Alexander Bay
Christopher Rockstar
Jones Collier

编辑: 删除左前方空格导致@ TigerhawkT3太"空间敏感" (在他自己的宗教中) )

P.S。 \x00-\x7f ASCII 字符范围