Question

我正在尝试从包含非字母字符的字符串列表中删除单词，例如：

["The", "sailor", "is", "sick", "."] -> ["The", "sailor", "is", "sick"]

但我不能简单地删除包含非alpha字符的单词，因为可能的情况如下：

["The", "U.S.", "is", "big", "."] -> ["The", "U.S.", "is", "big"] (acronym kept but period is removed)

我需要提出一个正则表达式或类似的方法来处理像这样的简单案例（所有类型的标点符号）：

["And", ",", "there", "she", "is", "."] -> ["And", "there", "she", "is"]

我使用自然语言包装类将句子转换为左侧列表，但有时候列表要复杂得多：

string:   "round up the "blonde bombshells' a all (well almost all)"
list: ["round", "up", "the", "''", "blonde", "bombshell", "\\", 
          "a", "all", "-lrb-", "well", "almost", "all", "-rrb-"]

正如您所看到的，包装器会转换或删除一些字符，例如括号和撇号。我想把所有这些无关的子串去掉一个更干净的东西：

list: ["round", "up", "the", "blonde", "bombshell",
          "a", "all", "well", "almost", "all"]

我相当陌生，我认为正则表达式是我最好的方法，但不知道如何将第一个列表转换为清理过的第二个列表，并感谢任何帮助！

Answer 1

这似乎符合您的描述：

cases=[
    ["The", "sailor", "is", "sick", "."],
    ["The", "U.S.", "is", "big", "."],
    ["round", "up", "the", "''", "blonde", "bombshell", "\\", 
    "a", "all", "-lrb-", "well", "almost", "all", "-rrb-"],
]

import re

for li in cases:
    print '{}\n\t->{}'.format(li, [w for w in li if re.search(r'^[a-zA-Z]', w)])

打印：

['The', 'sailor', 'is', 'sick', '.']
    ->['The', 'sailor', 'is', 'sick']
['The', 'U.S.', 'is', 'big', '.']
    ->['The', 'U.S.', 'is', 'big']
['round', 'up', 'the', "''", 'blonde', 'bombshell', '\\', 'a', 'all', '-lrb-', 'well', 'almost', 'all', '-rrb-']
    ->['round', 'up', 'the', 'blonde', 'bombshell', 'a', 'all', 'well', 'almost', 'all']

如果正确的话，你可以完全没有正则表达式：

for li in cases:
    print '{}\n\t->{}'.format(li, [w for w in li if w[0].isalpha()])

Answer 2

您可以使用punctuation执行此操作：

>>> from string import punctuation
>>> [i for i in lst if i not in punctuation]   
['The', 'U.S.', 'is', 'big']

Answer 3

确保每个字符串至少包含一个字母数字：

import re

expr = re.compile(r"\w+")
test = ["And", ",", "there", "she", "is", ".", "U.S."]

filtered = [v for v in test if expr.search(v)]
print(filtered)

打印

['And', 'there', 'she', 'is', 'U.S.']

替代方案是排除数字，并确保字符串不以非字母字符开头：

# only alpha
expr = re.compile(r"[a-zA-Z]+")
test = ["round", "up", "the", "''", "blonde", "bombshell", "\\",
        "a", "all", "-lrb-", "well", "almost", "all", "-rrb-"]
# use match() here
filtered = [v for v in test if expr.match(v)]
print(filtered)

打印

['round', 'up', 'the', 'blonde', 'bombshell', 'a', 'all', 'well', 'almost', 'all']

正则表达式从列表中删除非字母A-Z a-z（例外）

3 个答案: