我的数据在列表中。我对数据进行了标记。数据包含非字母(例如,?,。,!)。
我想从下面的列表中删除非字母(例如,?,。,!)。
[['comfortable',
'questions?',
'menu',
'items!',
'time',
'lived',
'there,',
'could',
'easily',
'direct',
'people',
'appropriate',
'menu',
'choices',
'given',
'allergies.'],
['.',
'sure',
'giving',
'wheat',
'fiction',
'free',
'foodthis',
'place',
'clean.']]
输出应如下所示:
[['comfortable',
'questions',
'menu',
'items',
'time',
'lived',
'there,',
'could',
'easily',
'direct',
'people',
'appropriate',
'menu',
'choices',
'given',
'allergies'],
['sure',
'giving',
'wheat',
'fiction',
'free',
'foodthis',
'place',
'clean']]
我尝试了下面的代码(不工作):
import re
tokens = [re.sub(r'[^A-Za-z0-9]+', '', x) for x in texts]
有什么建议吗?
答案 0 :(得分:3)
你的正则表达式方法不起作用,因为你拥有的是列表列表,因此你试图将内部列表传递给re.sub
。
您也应该遍历内部列表,然后使用re.sub
。示例 -
>>> tokens = [[y for y in (re.sub(r'[^A-Za-z0-9]+', '', x) for x in sublst) if y] for sublst in texts]
>>> pprint.pprint(tokens)
[['comfortable',
'questions',
'menu',
'items',
'time',
'lived',
'there',
'could',
'easily',
'direct',
'people',
'appropriate',
'menu',
'choices',
'given',
'allergies'],
['sure', 'giving', 'wheat', 'fiction', 'free', 'foodthis', 'place', 'clean']]
答案 1 :(得分:1)
几乎在那里,你的令牌是一个列表列表,但你的列表理解只是查看第一个列表的元素。
from pprint import pprint
import re
tokens = [['comfortable',
'questions?',
'menu',
'items!',
'time',
'lived',
'there,',
'could',
'easily',
'direct',
'people',
'appropriate',
'menu',
'choices',
'given',
'allergies.'],
['.',
'sure',
'giving',
'wheat',
'fiction',
'free',
'foodthis',
'place',
'clean.']]
out = [list(filter(None, [re.sub(r'[^A-Za-z0-9]+', '', x) for x in y])) for y in
tokens]
pprint(out)
产生
[['comfortable',
'questions',
'menu',
'items',
'time',
'lived',
'there',
'could',
'easily',
'direct',
'people',
'appropriate',
'menu',
'choices',
'given',
'allergies'],
['sure',
'giving',
'wheat',
'fiction',
'free',
'foodthis',
'place',
'clean']]
答案 2 :(得分:1)
如果它总是在最后,你可以str.rstrip
标点符号:
from string import punctuation
for sub in l:
sub[:] = (word for word in (w.rstrip(punctuation) for w in sub)
if word)
输出:
from pprint import pprint as pp
pp(l)
[['comfortable',
'questions',
'menu',
'items',
'time',
'lived',
'there',
'could',
'easily',
'direct',
'people',
'appropriate',
'menu',
'choices',
'given',
'allergies'],
['sure', 'giving', 'wheat', 'fiction', 'free', 'foodthis', 'place', 'clean']]
或使用str.translate
从任何地方删除:
from string import punctuation
for sub in l:
sub[:] = (word for word in (w.translate(None, punctuation) for w in sub)
if word)
输出:
[['comfortable',
'questions',
'menu',
'items',
'time',
'lived',
'there',
'could',
'easily',
'direct',
'people',
'appropriate',
'menu',
'choices',
'given',
'allergies'],
['sure', 'giving', 'wheat', 'fiction', 'free', 'foodthis', 'place', 'clean']]
如果你想要一个新的清单:
cleaned = [word for sub in l
for word in (w.translate(None, punctuation)
for w in sub) if word]
翻译比正则表达式更有效,如果标点符号在最后rstrip
再次更有效:
In [2]: %%timeit
....: r = re.compile(r'[^A-Za-z0-9]+')
....: [[y for y in (r.sub('', x) for x in sublst) if y] for sublst in l]
....:
10000 loops, best of 3: 37.3 µs per loop
In [3]: %%timeit
....: out = [list(filter(None, [re.sub(r'[^A-Za-z0-9]+', '', x) for x in y])) for y in
....: l]
....:
10000 loops, best of 3: 58.3 µs per loop
In [4]: from string import punctuation
In [5]: %%timeit
...: cleaned = [word for sub in l
...: for word in (w.translate(None, punctuation)
...: for w in sub) if word]
...:
100000 loops, best of 3: 11.6 µs per loop
In [6]: %%timeit
...: cleaned = [word for sub in l
...: for word in (w.rstrip(punctuation)
...: for w in sub) if word]
...:
100000 loops, best of 3: 6.81 µs per loop
In [7]: %%timeit
result = []
for d in l:
for r in string.punctuation:
d = [x.replace(r, '') for x in d]
result.append([x for x in d if d])
....:
10000 loops, best of 3: 160 µs per loop
答案 3 :(得分:0)
new_lst = []
for inner in lst:
new_inner = []
for word in inner:
filtered = ''.join([filter(str.isalpha, c) for c in word])
if len(filtered) > 0:
new_inner.append(filtered)
new_lst.append(new_inner)
print new_lst
答案 4 :(得分:-1)
import string
data = [['comfortable',
'questions?',
'menu',
'items!',
'time',
'lived',
'there,',
'could',
'easily',
'direct',
'people',
'appropriate',
'menu',
'choices',
'given',
'allergies.'],
['.',
'sure',
'giving',
'wheat',
'fiction',
'free',
'foodthis',
'place',
'clean.']]
result = []
for d in data:
for r in string.punctuation:
d = [x.replace(r, '') for x in d]
result.append([x for x in d if d])
print result