Question

好的，所以我在这个文件中有大约1000个重复的短语，因此手动执行此操作不是一个选项。请注意，这些是PHRASES，而不是行或单词，每个“短语”大约10行。

我试图摆脱重复的短语，但唯一能使“项目”（或短语）重复的是位置语法。例如：

    class Item0
    {
        position[]={4347.6001,0,3214.6399};
        azimut=128.81599;
        special="NONE";
        id=1;
        side="EMPTY";
        vehicle="Land_fortified_nest_small";
        lock="UNLOCKED";
        skill=0.2;
        init="this setPos [4347.6, 3214.64, 0]; this setDir 128.816;";
    };
    class Item1
    {
        position[]={4347.6001,0,3214.6399};
        azimut=128.81599;
        special="NONE";
        id=2;
        side="EMPTY";
        vehicle="Land_fortified_nest_small";
        lock="UNLOCKED";
        skill=0.2;
        init="this setPos [4347.6, 3214.64, 0]; this setDir 128.816;";
    };

现在前两个短语是重复的，但ID和ITEM＃是不同的，因此识别重复短语的唯一方法是通过position [] = {}参数。当2个短语具有相同的位置时，这两个短语是重复的，无论是ID还是IDEM＃。

所以我的目标是使用某种类型的代码，脚本，程序或正则表达式来删除所有重复的短语，但保留第一个副本不变。因此，如果有三个重复项，则会保留一个短语，但会删除这两个短语。我该怎么做呢？

所需输入/输出的示例：

输入：

    class Item0
    {
        position[]={4347.6001,0,3214.6399};
        azimut=128.81599;
        special="NONE";
        id=1;
        side="EMPTY";
        vehicle="Land_fortified_nest_small";
        lock="UNLOCKED";
        skill=0.2;
        init="this setPos [4347.6, 3214.64, 0]; this setDir 128.816;";
    };
        class Item1
    {
        position[]={4682.6001,0,3847.6399};
        azimut=128.81599;
        special="NONE";
        id=2;
        side="EMPTY";
        vehicle="Land_fortified_nest_small";
        lock="UNLOCKED";
        skill=0.2;
        init="this setPos [4682.6, 3847.64, 0]; this setDir 128.816;";
    };
        class Item2
    {
        position[]={4347.6001,0,3214.6399};
        azimut=128.81599;
        special="NONE";
        id=3;
        side="EMPTY";
        vehicle="Land_fortified_nest_small";
        lock="UNLOCKED";
        skill=0.2;
        init="this setPos [4347.6, 3214.64, 0]; this setDir 128.816;";
    };

输出：

    class Item0
    {
        position[]={4347.6001,0,3214.6399};
        azimut=128.81599;
        special="NONE";
        id=1;
        side="EMPTY";
        vehicle="Land_fortified_nest_small";
        lock="UNLOCKED";
        skill=0.2;
        init="this setPos [4347.6, 3214.64, 0]; this setDir 128.816;";
    };
        class Item1
    {
        position[]={4682.6001,0,3847.6399};
        azimut=128.81599;
        special="NONE";
        id=2;
        side="EMPTY";
        vehicle="Land_fortified_nest_small";
        lock="UNLOCKED";
        skill=0.2;
        init="this setPos [4682.6, 3847.64, 0]; this setDir 128.816;";
    };

Answer 1

我最初的做法是：

创建一个数组来存储唯一的位置
解析文件，如果位置在数组中，则跳过。否则，输出到文件＆amp;存储在数组中。
循环直至EOF

这将为您提供您想要的但不是最佳解决方案。考虑存储项目第一次遇到的方法，以及稍后检查它的方式（扫描数组可能需要一段时间）。

Answer 2

如果它是类类型，那么您可以考虑使用SET并添加类元素。

      Set<Item> itemSet  = new HashSet<Item>;
      itemSet.add(new Item());

在添加所有项目结束时，您将只保留唯一的项目。

您可以将ID保留在争用之外，并通过检查项目是否已插入来检查插入的ID。考虑到ID是有序的，这将起作用。要保持id out，请使用具有相同数据成员的新类，不包括id。

我使用了另一个例子（很容易构建）希望它有所帮助

    int item[] = null;
    int offset = 0;
    int counter = 0;
    ArrayList<Integer> duplicateids = new ArrayList<Integer>();
    Set<Integer> afterDups= new HashSet<Integer>();
    for (int i : item) {
        counter++;
        //you can create a new class excluding the id and initialize it here
        if(!afterDups.add(i))
            duplicateids.add(counter);
    }

编辑：

好的，我错过了从文件中挑选的内容，所以添加了这个答案。您可以检查每一行，并且鉴于您的文件属于此格式，您不希望比较Class Item0和id=1;行。休息时，您可以逐行读取文件并将其放在一个字符串中。一旦类完成（由行开始表示为class），您可以设置为文本创建另一个字符串。您将从凭据（id和class）中分离数据。使用分隔符，您可以从中再次拆分字符串并重新创建文件。

public static void main(String args[])
{
    try{
        FileInputStream fstream = new FileInputStream("file.txt");
        DataInputStream in = new DataInputStream(fstream);
        BufferedReader br = new BufferedReader(new InputStreamReader(in));
        String strLine;
        String seperator = "$$";
        //this contains the $$ seperated class data items
        String currentClassText = "";
        //this contains the $$ seperated class name the opening braces and the closing braces
        String  currentClassCredentilas= "";
        Set<String> texts = new HashSet<String>();
        ArrayList<String> credentials = new ArrayList<String>();
        while ((strLine = br.readLine()) != null)   {
            if(strLine.contains("id=") || strLine.contains("class") || strLine.contains("};"))
                currentClassCredentilas.concat(strLine + seperator);
            else
                currentClassText.concat(strLine + seperator);

            //check if the class has completed
            if(strLine.contains("};")){
                //text is not a duplicate
                if(texts.add(currentClassText)){
                    credentials.add(currentClassCredentilas + seperator);
                }
                //set everything back to empty for the next round
                currentClassCredentilas = currentClassText = "";
            }
            System.out.println (strLine);
        }
        in.close();
    }catch (Exception e){
        System.err.println("Error: " + e.getMessage());
    }
}

Answer 3

我会生成每个短语的哈希值并将其存储到地图中。继续添加新短语并忽略（如果已存在）。散列码和映射值始终是唯一的，因此您不会有重复项。

如何删除文本文件中的重复短语？

输入：

输出：

3 个答案: