我必须根据以下内容将POS标记单词列表拆分为子列表 到使用的POS标签。 我的列表如下:
List=[", -> ','", ". -> '!'", ". -> '.'", ". -> '?'", "CC -> 'but'", "CD -> 'hundred'",
"CD -> 'one'", "DT -> 'the'", "EX -> 'There'","IN -> 'as'", "IN -> 'because'",
"IN -> 'if'", "IN -> 'in'", "JJ -> 'Sure'", 'MD -> "\'ll"', "MD -> 'ca'",
"MD -> 'can'", "MD -> 'will'", "MD -> 'would'", "NN -> 'Applause'",
"NN -> 'anybody'", "NN -> 'doubt'", "NNP -> 'Syria'",
"NNS -> 'Generals'", "NNS -> 'people'", "NNS -> 'states'", "PRP -> 'it'",
"PRP$ -> 'our'", "RB -> 'there'", "RBR -> 'more'", "RP -> 'out'", "TO -> 'to'",
"UH -> 'Oh'", "UH -> 'Wow'", "VB -> 'stop'", "VB -> 'want'", "VBD -> 'knew'",
"VBD -> 'was'", "VBG -> 'allowing'", "VBG -> 'doing'", "VBG -> 'going'",
"VBN -> 'called'", "VBP -> 'take'", 'VBZ -> "\'s"', "VBZ -> 'is'",
"WDT -> 'that'", "WP -> 'what'"]
我想要的输出将是
[["IN -> 'as'", "IN -> 'because'", "IN -> 'if'", "IN -> 'in'"],["UH -> 'Oh'", "UH -> 'Wow'"]]
甚至更好
CC = ['but']
CD = ['hundred', 'one']
我搜索了很多东西,但我能找到的至少部分功能是:
from itertools import groupby
print([list(g) for k, g in groupby(List, key=lambda x: x[0])])
我一直在使用x的值,但是norhing看起来效果很好。
我也很想使用这样的东西:
RB = []
for item in List:
if item.startswith('RB'):
g=re.findall('-> (.*)', item)
RB.append(g)
这当然应该可以,但是要为大约40种不同的POS标签执行此操作会很痛苦。一定有更简单的方法。
答案 0 :(得分:0)
使用defaultdict:
from collections import defaultdict
List = [", -> ','", ". -> '!'", ". -> '.'", ". -> '?'", "CC -> 'but'", "CD -> 'hundred'",
"CD -> 'one'", "DT -> 'the'", "EX -> 'There'","IN -> 'as'", "IN -> 'because'",
"IN -> 'if'", "IN -> 'in'", "JJ -> 'Sure'", 'MD -> "\'ll"', "MD -> 'ca'",
"MD -> 'can'", "MD -> 'will'", "MD -> 'would'", "NN -> 'Applause'",
"NN -> 'anybody'", "NN -> 'doubt'", "NNP -> 'Syria'",
"NNS -> 'Generals'", "NNS -> 'people'", "NNS -> 'states'", "PRP -> 'it'",
"PRP$ -> 'our'", "RB -> 'there'", "RBR -> 'more'", "RP -> 'out'", "TO -> 'to'",
"UH -> 'Oh'", "UH -> 'Wow'", "VB -> 'stop'", "VB -> 'want'", "VBD -> 'knew'",
"VBD -> 'was'", "VBG -> 'allowing'", "VBG -> 'doing'", "VBG -> 'going'",
"VBN -> 'called'", "VBP -> 'take'", 'VBZ -> "\'s"', "VBZ -> 'is'",
"WDT -> 'that'", "WP -> 'what'"]
data = defaultdict(set)
for key, value in (_.split('->') for _ in List):
d[key.strip()].add(value.strip().replace("'", '').replace('"', ''))
print(dict(data))
结果是:
{',': {','}, '.': {'.', '!', '?'}, 'CC': {'but'}, 'CD': {'hundred','one'}, 'DT': {'the'}, 'EX': {'There'}, 'IN': {'in', 'because', 'as', 'if'}, 'JJ': {'Sure'}, 'MD': {'will', 'ca', 'll', 'can', 'would'}, 'NN': {'Applause', 'doubt', 'anybody'}, 'NNP': {'Syria'}, 'NNS': {'Generals', 'states', 'people'}, 'PRP': {'it'}, 'PRP$': {'our'}, 'RB': {'there'}, 'RBR': {'more'}, 'RP': {'out'}, 'TO': {'to'}, 'UH': {'Wow', 'Oh'}, 'VB': {'want', 'stop'}, 'VBD': {'was', 'knew'}, 'VBG': {'going', 'allowing', 'doing'}, 'VBN': {'called'}, 'VBP': {'take'}, 'VBZ': {'s', 'is'}, 'WDT': {'that'}, 'WP': {'what'}}