Question

我的数据在列表中。我对数据进行了标记。数据包含非字母（例如，？，。，！）。

我想从下面的列表中删除非字母（例如，？，。，！）。

[['comfortable',
  'questions?',
  'menu',
  'items!',
  'time',
  'lived',
  'there,',
  'could',
  'easily',
  'direct',
  'people',
  'appropriate',
  'menu',
  'choices',
  'given',
  'allergies.'],
 ['.',
  'sure',
  'giving',
  'wheat',
  'fiction',
  'free',
  'foodthis',
  'place',
  'clean.']]

输出应如下所示：

[['comfortable',
  'questions',
  'menu',
  'items',
  'time',
  'lived',
  'there,',
  'could',
  'easily',
  'direct',
  'people',
  'appropriate',
  'menu',
  'choices',
  'given',
  'allergies'],
 ['sure',
  'giving',
  'wheat',
  'fiction',
  'free',
  'foodthis',
  'place',
  'clean']]

我尝试了下面的代码（不工作）：

import re 
tokens = [re.sub(r'[^A-Za-z0-9]+', '', x) for x in texts]

有什么建议吗？

Answer 1

你的正则表达式方法不起作用，因为你拥有的是列表列表，因此你试图将内部列表传递给re.sub。

您也应该遍历内部列表，然后使用re.sub。示例 -

>>> tokens = [[y for y in (re.sub(r'[^A-Za-z0-9]+', '', x) for x in sublst) if y] for sublst in texts]
>>> pprint.pprint(tokens)
[['comfortable',
  'questions',
  'menu',
  'items',
  'time',
  'lived',
  'there',
  'could',
  'easily',
  'direct',
  'people',
  'appropriate',
  'menu',
  'choices',
  'given',
  'allergies'],
 ['sure', 'giving', 'wheat', 'fiction', 'free', 'foodthis', 'place', 'clean']]

Answer 2

几乎在那里，你的令牌是一个列表列表，但你的列表理解只是查看第一个列表的元素。

from pprint import pprint

import re

tokens = [['comfortable',
           'questions?',
           'menu',
           'items!',
           'time',
           'lived',
           'there,',
           'could',
           'easily',
           'direct',
           'people',
           'appropriate',
           'menu',
           'choices',
           'given',
           'allergies.'],
          ['.',
           'sure',
           'giving',
           'wheat',
           'fiction',
           'free',
           'foodthis',
           'place',
           'clean.']]

out = [list(filter(None, [re.sub(r'[^A-Za-z0-9]+', '', x) for x in y])) for y in
       tokens]

pprint(out)

产生

[['comfortable',
  'questions',
  'menu',
  'items',
  'time',
  'lived',
  'there',
  'could',
  'easily',
  'direct',
  'people',
  'appropriate',
  'menu',
  'choices',
  'given',
  'allergies'],
 ['sure',
  'giving',
  'wheat',
  'fiction',
  'free',
  'foodthis',
  'place',
  'clean']]

Answer 3

如果它总是在最后，你可以str.rstrip标点符号：

from string import punctuation

for sub in l:
    sub[:] = (word for word in (w.rstrip(punctuation) for w in sub)
             if word)

输出：

from pprint import pprint    as pp
pp(l)


 [['comfortable',
  'questions',
  'menu',
  'items',
  'time',
  'lived',
  'there',
  'could',
  'easily',
  'direct',
  'people',
  'appropriate',
  'menu',
  'choices',
  'given',
  'allergies'],
 ['sure', 'giving', 'wheat', 'fiction', 'free', 'foodthis', 'place', 'clean']]

或使用str.translate从任何地方删除：

from string import punctuation

for sub in l:
    sub[:] = (word for word in (w.translate(None, punctuation) for w in sub) 
             if word)

输出：

[['comfortable',
  'questions',
  'menu',
  'items',
  'time',
  'lived',
  'there',
  'could',
  'easily',
  'direct',
  'people',
  'appropriate',
  'menu',
  'choices',
  'given',
  'allergies'],
 ['sure', 'giving', 'wheat', 'fiction', 'free', 'foodthis', 'place', 'clean']]

如果你想要一个新的清单：

cleaned = [word for sub in l
           for word in (w.translate(None, punctuation)
                        for w in sub) if word]

翻译比正则表达式更有效，如果标点符号在最后rstrip再次更有效：

 In [2]: %%timeit
   ....: r = re.compile(r'[^A-Za-z0-9]+')
   ....: [[y for y in (r.sub('', x) for x in sublst) if y] for sublst in l]
   ....: 
10000 loops, best of 3: 37.3 µs per loop

In [3]: %%timeit
   ....: out = [list(filter(None, [re.sub(r'[^A-Za-z0-9]+', '', x) for x in y])) for y in
   ....:        l]
   ....: 
10000 loops, best of 3: 58.3 µs per loop

In [4]: from string import punctuation

In [5]: %%timeit
   ...: cleaned = [word for sub in l
   ...:            for word in (w.translate(None, punctuation)
   ...:                         for w in sub) if word]
   ...: 

100000 loops, best of 3: 11.6 µs per loop

In [6]: %%timeit
  ...: cleaned = [word for sub in l
   ...:            for word in (w.rstrip(punctuation)
   ...:                         for w in sub) if word]
   ...: 

100000 loops, best of 3: 6.81 µs per loop
In [7]: %%timeit
result = []                    
for d in l:                                                       
    for r in string.punctuation:
        d = [x.replace(r, '') for x in d]
    result.append([x for x in d if d])
   ....: 
10000 loops, best of 3: 160 µs per loop

Answer 4

new_lst = []
for inner in lst:
    new_inner = []
    for word in inner:
        filtered = ''.join([filter(str.isalpha,  c) for c in word])
        if len(filtered) > 0:
            new_inner.append(filtered)
    new_lst.append(new_inner)
print new_lst

Answer 5

import string

data = [['comfortable',
  'questions?',
  'menu',
  'items!',
  'time',
  'lived',
  'there,',
  'could',
  'easily',
  'direct',
  'people',
  'appropriate',
  'menu',
  'choices',
  'given',
  'allergies.'],
 ['.',
  'sure',
  'giving',
  'wheat',
  'fiction',
  'free',
  'foodthis',
  'place',
  'clean.']]

result = []
for d in data:
    for r in string.punctuation:
        d = [x.replace(r, '') for x in d]
    result.append([x for x in d if d])
print result

从python列表中的标记中删除非字母

5 个答案: