Python如何提取所有以相同字符集开头的列表元素

时间:2019-03-06 18:50:54

标签: python nltk pos-tagger

我必须根据以下内容将POS标记单词列表拆分为子列表 到使用的POS标签。 我的列表如下:

List=[", -> ','", ". -> '!'", ". -> '.'", ". -> '?'", "CC -> 'but'", "CD -> 'hundred'",
      "CD -> 'one'", "DT -> 'the'", "EX -> 'There'","IN -> 'as'", "IN -> 'because'",
      "IN -> 'if'", "IN -> 'in'", "JJ -> 'Sure'", 'MD -> "\'ll"', "MD -> 'ca'",
      "MD -> 'can'", "MD -> 'will'", "MD -> 'would'", "NN -> 'Applause'",
      "NN -> 'anybody'", "NN -> 'doubt'", "NNP -> 'Syria'",
      "NNS -> 'Generals'", "NNS -> 'people'", "NNS -> 'states'",  "PRP -> 'it'",
      "PRP$ -> 'our'",  "RB -> 'there'", "RBR -> 'more'", "RP -> 'out'", "TO -> 'to'",
      "UH -> 'Oh'", "UH -> 'Wow'", "VB -> 'stop'", "VB -> 'want'", "VBD -> 'knew'",
      "VBD -> 'was'", "VBG -> 'allowing'", "VBG -> 'doing'", "VBG -> 'going'",
      "VBN -> 'called'", "VBP -> 'take'", 'VBZ -> "\'s"', "VBZ -> 'is'", 
      "WDT -> 'that'", "WP -> 'what'"]

我想要的输出将是

[["IN -> 'as'", "IN -> 'because'", "IN -> 'if'", "IN -> 'in'"],["UH -> 'Oh'", "UH -> 'Wow'"]]

甚至更好

CC = ['but']
CD = ['hundred', 'one']

我搜索了很多东西,但我能找到的至少部分功能是:

from itertools import groupby
print([list(g) for k, g in groupby(List, key=lambda x: x[0])])

我一直在使用x的值,但是norhing看起来效果很好。

我也很想使用这样的东西:

RB = []
for item in List:
    if item.startswith('RB'):
        g=re.findall('-> (.*)', item)
        RB.append(g)

这当然应该可以,但是要为大约40种不同的POS标签执行此操作会很痛苦。一定有更简单的方法。

1 个答案:

答案 0 :(得分:0)

使用defaultdict

from collections import defaultdict

List = [", -> ','", ". -> '!'", ". -> '.'", ". -> '?'", "CC -> 'but'", "CD -> 'hundred'",
      "CD -> 'one'", "DT -> 'the'", "EX -> 'There'","IN -> 'as'", "IN -> 'because'",
      "IN -> 'if'", "IN -> 'in'", "JJ -> 'Sure'", 'MD -> "\'ll"', "MD -> 'ca'",
      "MD -> 'can'", "MD -> 'will'", "MD -> 'would'", "NN -> 'Applause'",
      "NN -> 'anybody'", "NN -> 'doubt'", "NNP -> 'Syria'",
      "NNS -> 'Generals'", "NNS -> 'people'", "NNS -> 'states'",  "PRP -> 'it'",
      "PRP$ -> 'our'",  "RB -> 'there'", "RBR -> 'more'", "RP -> 'out'", "TO -> 'to'",
      "UH -> 'Oh'", "UH -> 'Wow'", "VB -> 'stop'", "VB -> 'want'", "VBD -> 'knew'",
      "VBD -> 'was'", "VBG -> 'allowing'", "VBG -> 'doing'", "VBG -> 'going'",
      "VBN -> 'called'", "VBP -> 'take'", 'VBZ -> "\'s"', "VBZ -> 'is'", 
      "WDT -> 'that'", "WP -> 'what'"]

data = defaultdict(set)
for key, value in (_.split('->') for _ in List):
  d[key.strip()].add(value.strip().replace("'", '').replace('"', ''))
print(dict(data))

结果是:

{',': {','}, '.': {'.', '!', '?'}, 'CC': {'but'}, 'CD': {'hundred','one'}, 'DT': {'the'}, 'EX': {'There'}, 'IN': {'in', 'because', 'as', 'if'}, 'JJ': {'Sure'}, 'MD': {'will', 'ca', 'll', 'can', 'would'}, 'NN': {'Applause', 'doubt', 'anybody'}, 'NNP': {'Syria'}, 'NNS': {'Generals', 'states', 'people'}, 'PRP': {'it'}, 'PRP$': {'our'}, 'RB': {'there'}, 'RBR': {'more'}, 'RP': {'out'}, 'TO': {'to'}, 'UH': {'Wow', 'Oh'}, 'VB': {'want', 'stop'}, 'VBD': {'was', 'knew'}, 'VBG': {'going', 'allowing', 'doing'}, 'VBN': {'called'}, 'VBP': {'take'}, 'VBZ': {'s', 'is'}, 'WDT': {'that'}, 'WP': {'what'}}