Python删除重复的名称

时间:2017-10-27 08:55:31

标签: python python-2.7 python-3.x

我有纯文本文件,每行包含单词:

3210    <DOCID>GH950102-000003<DOCID>/O
  3243  Australia/LOCATION
  3360  England/LOCATION
  3414  India/LOCATION
  3474  Melbourne/LOCATION
  3497  England/LOCATION
  3521  >India<TOPONYM>/O
  3526  >Zimbabwe<TOPONYM>/O
  3531  >England<TOPONYM>/O
  3536  >Melbourne<TOPONYM>/O
  3541  >England<TOPONYM>/O
  3546  >England<TOPONYM>/O
  3551  >Glasgow<TOPONYM>/O
  3556  >England<TOPONYM>/O
  3561  >England<TOPONYM>/O
  3566  >Australia<TOPONYM>/O
3568    <DOCID>GH950102-000004<DOCID>/O
  3739  Hampden/LOCATION
  3821  Hampden/LOCATION
  3838  Ibrox/LOCATION
  3861  Neerday/LOCATION
  4161  Fir Park/LOCATION
  4229  Park<TOPONYM>/O
  4234  >Hampden<TOPONYM>/O
  4239  >Hampden<TOPONYM>/O
  4244  >Midfield<TOPONYM>/O
  4249  >Glasgow<TOPONYM>/O
  4251  <DOCID>GH950102-000005<DOCID>/O
  4535  Edinburgh/LOCATION
  4840  Road<TOPONYM>/O
  4845  >Edinburgh<TOPONYM>/O
  4850  >Glasgow<TOPONYM>/O``

我想在此列表中删除相同的位置名称,它应如下所示:

3210    <DOCID>GH950102-000003<DOCID>/O
  3243  Australia/LOCATION
  3360  England/LOCATION
  3414  India/LOCATION
  3474  Melbourne/LOCATION
  3497  England/LOCATION
  3526  >Zimbabwe<TOPONYM>/O
  3551  >Glasgow<TOPONYM>/O
3568    <DOCID>GH950102-000004<DOCID>/O
  3739  Hampden/LOCATION
  3838  Ibrox/LOCATION
  3861  Neerday/LOCATION
  4161  Fir Park/LOCATION
  4229  Park<TOPONYM>/O
  4244  >Midfield<TOPONYM>/O
  4249  >Glasgow<TOPONYM>/O
  4251  <DOCID>GH950102-000005<DOCID>/O
  4535  Edinburgh/LOCATION
  4840  Road<TOPONYM>/O
  4850  >Glasgow<TOPONYM>/O

我想删除重复的位置名称,docid应保留在文件中。我知道有一种方法通过linux使用uniq,但如果我将运行它将删除不同docid内的位置。 无论如何,如果位置名称相同,那么它会通过每个docid和docid进行拆分,那么它应该删除重复的名称。

2 个答案:

答案 0 :(得分:3)

我是通过手机写的,所以这不是一个完整的解决方案,而是关键点:

import re
Docid=re.compile("^ *\d+ +<DOCID>")
Location=re.compile("^ *\d +>?(. +)/")
Lines={} 
for line in file:
    if re.match(Docid,line):
        Lines={}
        print line
    else:
        loc=re.findall(Location, line)[0]
        if loc not in Lines.keys():
             print line
             Lines[loc] = True

基本上它检查它的每一行都不是新的docid。如果不是,则它会尝试读取位置并查看它是否已被读取。如果没有,则打印该位置并将其添加到位置列表中。

如果有新的docid,它会重置最后一个读取位置。

答案 1 :(得分:2)

这是一种方法。

import string
filename = 'testfile'
lines = tuple(open(filename, 'r'))

final_list = []
unique_list = [] # this resets itself every docid
for line in lines:
    currentline = str(line)
    if 'DOCID' in currentline:
        unique_list = []  # this resets itself every docid
        final_list.append(line)
    else:
        exclude = set(string.punctuation)
        currentline = ''.join(ch if ch not in exclude else " " for ch in currentline)
        city = currentline.split()[1]
        if city not in unique_list:
            unique_list.append(city)
            final_list.append(line)

for line in final_list:
    print(line)

输出:

3210    <DOCID>GH950102-000003<DOCID>/O

  3243  Australia/LOCATION

  3360  England/LOCATION

  3414  India/LOCATION

  3474  Melbourne/LOCATION

  3526  >Zimbabwe<TOPONYM>/O

  3551  >Glasgow<TOPONYM>/O

3568    <DOCID>GH950102-000004<DOCID>/O

  3739  Hampden/LOCATION

  3838  Ibrox/LOCATION

  3861  Neerday/LOCATION

  4161  Fir Park/LOCATION

  4229  Park<TOPONYM>/O

  4244  >Midfield<TOPONYM>/O

  4249  >Glasgow<TOPONYM>/O

  4251  <DOCID>GH950102-000005<DOCID>/O

  4535  Edinburgh/LOCATION

  4840  Road<TOPONYM>/O

  4850  >Glasgow<TOPONYM>/O``

注意:testfile是包含输入文本的文本文件。如有必要,您可以优化代码。