使用python删除文本文件中的重复单词组合

时间:2012-10-12 11:47:16

标签: python

在eumiro Delete duplicate rows in textfile - except it contains a "{" or "}"的帮助下 我可以成功删除大文本文件中的重复行。这是从60MB到3MB文本文件的巨大一步。

但现在我要删除像这样的重复单词:

  @INBOOK{Miller1992,
  author = {Miller, Rowland S. und Mark R. Leary and Miller, Rowland S. und Mark
    R. Leary and Miller, Rowland S. und Mark R. Leary and Miller, Rowland
    S. und Mark R. Leary and Miller, Rowland S. und Mark R. Leary and
    Miller, Rowland S. und Mark R. Leary and Miller, Rowland S. und Mark
    Miller, Rowland S. und Mark R. Leary},
  year = {1992},
  editor = {Teun A. van Dijk and Teun A. van Dijk and Teun A. van Dijk and Teun
    A. van Dijk and Teun A. van Dijk and Teun A. van Dijk and Teun A.
    van Dijk and Teun A. van Dijk and Teun A. van Dijk and Teun A. van
    Dijk and Teun A. van Dijk and Teun A. van Dijk and Teun A. van Dijk
    and Teun A. van Dijk and Teun A. van Dijk and Teun A. van Dijk and
    Teun A. van Dijk and Teun A. van Dijk and Teun A. van Dijk and Teun
    and Teun A. van Dijk and Teun A. van Dijk and Teun A. van Dijk},
  title = {Handbook of discourse analysis (Bd. 3/4)},

结果应如下所示:

  @INBOOK{Miller1992,
  author = {Miller,  Rowland S. und Mark R. Leary},
  year = {1992},
  editor = {Teun A. van Dijk},
  title = {Handbook of discourse analysis (Bd. 3/4)},

文本文件有70000行,并且可以在多个条目中使用authornames。因此,只应删除大括号(多行)之间的重复:

  author = {Miller, Rowland S. und Mark R. Leary and Miller, Rowland S. und Mark
  R. Leary and Miller, Rowland S. und Mark R. Leary and Miller, Rowland
  S. und Mark R. Leary and Miller, Rowland S. und Mark R. Leary and
  Miller, Rowland S. und Mark R. Leary and Miller, Rowland S. und Mark
  Miller, Rowland S. und Mark R. Leary},

我试图修改我的Python-Skript删除重复的行以删除大括号之间的重复单词,但我被困住了:

words_seen = set() # holds words already seen 
outfile = open("literatur_clean.txt", "w") 
for line in open("literatur_dupl.txt", "r"): 
    if ('{' in line or '}' in line
        # some code to check whether the words are duplicate
outfile.close() 

1 个答案:

答案 0 :(得分:1)

根据您当前的数据集,它看起来不是重复单词的问题,而是有时作者或编辑器重复n次。

您可以尝试拆分字符串“和”。然后你可以看到剩下的项目是否都是一样的。 (例如,将所有字符串放在一个集合中或作为字典中的键。)如果集合的长度等于1,则删除所有重复项。如果没有,可能“和”也是作者或编辑名称的一部分。你必须再合并两个。

如果这不起作用(例如,因为您的数据集不像建议的那样整洁),您可以通过查找子集匹配找到重复的匹配:

Miller, Rowland S. und Mark R. Leary and Miller, Rowland S. und Mark R. Leary 
^                                        ^
1                                        2

在字符串开头之后将指针增加到文本字符串中。对于每个位置,找到字符串开头的最长子匹配。保存这些子匹配。