按可用连字符数附加列表项

时间:2014-10-08 01:59:13

标签: python

mylist = [('country', 'NN'), ('shoot', 'NN-DT-PPL'), ('threats', 'NN-JJ'), ('both','RB-JJ-NN'), ('during', 'NN-VBD-JJ-RB'), ('former', 'NN-RB'), ('school', 'NN-CC-JJ-DT'), 
    ('teacher', 'NN-VBZ-PPL-JJ-DT'), ('receive', 'VBZ'), ('batman', 'NN-IN-ABX-CD-RB')]

我有一个名为mylist的列表。它由带有单词的元组和随机标签组成。我不想使用reg-ex。最小标记为1,最大标记为5.我希望根据标记数量有5个不同的列表。

对于一个标签元组,我尝试了这个:

one=[] for i in mylist: if '-' not in i[1]: one.append(i) print one

正确打印[('country', 'NN'), [('receive', 'VBZ')

对于第二个标签,我希望打印[('threats', 'NN-JJ'), [('former', 'NN-RB')

依此类推第三,第四和第五个标签集。我无法弄清楚如何做到这一点。

我的实际文件有n个标签,它包含大约1000万个单词及其标签。有什么办法可以让我们知道哪个词有最大的不同标签?

这将是非常有帮助的!

6 个答案:

答案 0 :(得分:3)

您可以使用defaultdict来整理数据,并使用.count来计算-的数量。

from collections import defaultdict

mylist = [('country', 'NN'), ('shoot', 'NN-DT-PPL'), ... ]
res = defaultdict(list)

for item, tags in mylist:
    res[tags.count('-') + 1].append((item, tags))

您可以使用以下代码打印结果。

for k, v in res.items():
    print(str(k) + ": " + str(v))

打印:

brunsgaard@archbook /tmp> python test2.py
1: [('country', 'NN'), ('receive', 'VBZ')]
2: [('threats', 'NN-JJ'), ('former', 'NN-RB')]
3: [('shoot', 'NN-DT-PPL'), ('both', 'RB-JJ-NN')]
4: [('during', 'NN-VBD-JJ-RB'), ('school', 'NN-CC-JJ-DT')]
5: [('teacher', 'NN-VBZ-PPL-JJ-DT'), ('batman', 'NN-IN-ABX-CD-RB')]

答案 1 :(得分:1)

其他方式

from itertools import groupby
from operator import itemgetter


a=[('country', 'NN'), ('shoot', 'NN-DT-PPL'), ('threats', 'NN-JJ'), ('both','RB-JJ-NN'), ('during', 'NN-VBD-JJ-RB'), ('former', 'NN-RB'), ('school', 'NN-CC-JJ-DT'),
    ('teacher', 'NN-VBZ-PPL-JJ-DT'), ('receive', 'VBZ'), ('batman', 'NN-IN-ABX-CD-RB')]


func=lambda x:len(x[1].split('-'))
for k,g in groupby(sorted(a,key=func),key=func):
    print k,list(g)

#0utput
1 [('country', 'NN'), ('receive', 'VBZ')]
2 [('threats', 'NN-JJ'), ('former', 'NN-RB')]
3 [('shoot', 'NN-DT-PPL'), ('both', 'RB-JJ-NN')]
4 [('during', 'NN-VBD-JJ-RB'), ('school', 'NN-CC-JJ-DT')]
5 [('teacher', 'NN-VBZ-PPL-JJ-DT'), ('batman', 'NN-IN-ABX-CD-RB')]

答案 2 :(得分:1)

mylist = [('country', 'NN'), ('shoot', 'NN-DT-PPL'), ... ]
res = defaultdict(list)

for item, tags in mylist:
    res[tags.count('-') + 1].append((item, tags))

答案 3 :(得分:0)

您可以使用' - '分割字符串。作为分隔符并计算生成的结果列表中的元素数量,如下所示(对于3个标记) -

>>> [t for t in mylist if len(t[1].split('-')) == 3]
[('shoot', 'NN-DT-PPL'), ('both', 'RB-JJ-NN')]

答案 4 :(得分:0)

最大短划线计数为:

max_dash_count = max(i[1].count('-') for i in mylist) + 1

但是,使用词典有更有效的方法:

dash_dict = dict()
for i in mylist:
    count = i[1].count('-') + 1
    if count in dash_dict:
        dash_dict[count].add(i)
    else:
        dash_dict[count] = [i]

之后,您将获得一个可以轻松迭代的列表字典:

for count in sorted(dash_dict.keys()):
    print 'Items with ' + str(count) + ' dashes:'
    for i in dash_dict[count]:
        print repr(i)

答案 5 :(得分:0)

#!/usr/bin/python

mylist = [('country', 'NN'), ('shoot', 'NN-DT-PPL'), ('threats', 'NN-JJ'), ('both','RB-JJ-NN'), ('during', 'NN-VBD-JJ-RB'), ('former', 'NN-RB'), ('school', 'NN-CC-JJ-DT'), ('teacher', 'NN-VBZ-PPL-JJ-DT'), ('receive', 'VBZ'), ('batman', 'NN-IN-ABX-CD-RB')]
MAX_TAG = 5
def findTag():
   d = {}
   for tup in mylist:
      a,b = tup
      n = b.count('-')
      if not 0 <= n <= MAX_TAG - 1:
         continue
      if n not in d:
         d[n] = []
      d[n].append(tup)

   for k in sorted(d.keys()):
      print '{} => {}'.format(k+1, d[k])
if __name__ == '__main__':
   findTag()

1 => [('country', 'NN'), ('receive', 'VBZ')]
2 => [('threats', 'NN-JJ'), ('former', 'NN-RB')]
3 => [('shoot', 'NN-DT-PPL'), ('both', 'RB-JJ-NN')]
4 => [('during', 'NN-VBD-JJ-RB'), ('school', 'NN-CC-JJ-DT')]
5 => [('teacher', 'NN-VBZ-PPL-JJ-DT'), ('batman', 'NN-IN-ABX-CD-RB')]