如何从java中的csv文件中删除重复项

时间:2017-11-22 01:38:54

标签: java csv duplicates

我正在尝试从eventDetectionName()创建的csv文件中删除包含推文令牌(列[5])重复的内容,但在运行EventDetectioncopy.java后,删除了重复的推文令牌但仍有一些重复。

推文令牌的重复意味着推文令牌字符串在同一群集ID中具有相同的内容。

以下是代码:

import java.io.*;
import java.util.*;

public class EventDetectioncopy {
    public static void main(String[] args) throws FileNotFoundException, IOException{
        System.out.print("Enter a name for new Tweet Cluster sorting by name entity: ");
        BufferedReader scanName = new BufferedReader(new InputStreamReader(System.in));
        String newNamefile = scanName.readLine();

        System.out.print("Enter a name for new Tweet Cluster sorting by tweet tokens: ");
        BufferedReader scanToken = new BufferedReader(new InputStreamReader(System.in));
        String newTokenfile = scanToken.readLine();
        try{

            eventDetectionName(newNamefile);
            eventDetectionToken(newNamefile, newTokenfile);

        }
        catch (FileNotFoundException e) {
            System.out.println(e);
        }
        catch (IOException e){}
    }

    //remove duplicate tweet token
    public static void eventDetectionToken(String fileInput, String fileOutput) throws FileNotFoundException, IOException{
        FileWriter newCsv = new FileWriter(fileOutput + ".csv");
        BufferedWriter newCsvBW = new BufferedWriter(newCsv);
        BufferedReader reader = new BufferedReader(new FileReader(fileInput + ".csv"));
        String data;
        try{
            String temp = null;
            List<String> tempList = new ArrayList<String>();
            do
            {
                data = reader.readLine();
                String tweetToken = null;

                if(data != null)
                {
                    String[] splitText = data.split(",");
                    tweetToken = splitText[5];
                }

                if(temp != null)
                {
                    if (data == null || tweetToken.contains(tweetToken))
                    {
                        if (!(temp.equals(tweetToken)))
                        {
                            for (int i = 0; i < tempList.size(); i++) 
                            {
                                newCsvBW.append(tempList.get(i));
                                newCsvBW.append("\n");
                            }
                        }
                        tempList.clear();
                        temp = tweetToken;
                    }
                }
                else
                {
                    temp = tweetToken;
                }
                tempList.add(data);
            }
            while(data != null);
        }
        finally
        {
            newCsvBW.close();
            reader.close();
        }
    }

    //entity name that occurs more than 10 times
    public static void eventDetectionName(String filename) throws FileNotFoundException, IOException{
        String csv = "1day/clusters.sortedby.clusterid.csv";
        FileWriter newCsv = new FileWriter(filename + ".csv");
        BufferedWriter newCsvBW = new BufferedWriter(newCsv);
        BufferedReader reader = new BufferedReader(new FileReader(csv));
        String data;

        try{
            String temp = null;
            List<String> tempList = new ArrayList<String>();
            do 
            {
                data = reader.readLine();
                String nameEntity = null;
                if (data != null) 
                {
                    String[] splitText = data.split(",");
                    nameEntity = splitText[1];
                }
                if (temp != null) 
                {
                    if (data == null || !(nameEntity.equals(temp))) 
                    {
                        if (tempList.size() >= 10) 
                        {
                            for (int i = 0; i < tempList.size(); i++) 
                            {
                                newCsvBW.append(tempList.get(i));
                                newCsvBW.append("\n");
                            }
                        }
                        tempList.clear();
                        temp = nameEntity;
                    }
                } 
                else 
                {
                    temp = nameEntity;
                }
                tempList.add(data);
            } 
            while (data != null);
        }
        finally
        {
            reader.close();
            newCsvBW.close();
        }

    }
}

以下是执行eventDetectionName()以对实体名称进行排序超过10次的重复项的原始内容,其中重复项尚未处理:

[clusterid], [name entitiy], [tweetid], [timestamp], [userid], [tweet token], [tweet text]

    7722    lenovo  2.56142E+17 1.3499E+12  236705687   lenovo top hp becom 1 pc maker zdnet lenovo top hp becom 1 pc makerzdnetsummari china le    Lenovo tops HP to become No. 1 PC maker - ZDNet: Lenovo tops HP to become No. 1 PC makerZDNetSummary: China's Le... 
    7722    lenovo  2.56143E+17 1.3499E+12  72541972    lenovo top hp becom 1 pc maker zdnet lenovo top hp becom 1 pc makerzdnetsummari china le    Lenovo tops HP to become No. 1 PC maker - ZDNet: Lenovo tops HP to become No. 1 PC makerZDNetSummary: China's Le...
    7722    lenovo  2.56165E+17 1.34991E+12 112115244   lenovo overtak hp world top pc maker q3 Lenovo Overtakes HP as World’s Top PC Maker in Q3 
    7722    lenovo  2.56165E+17 1.34991E+12 14886375    ahess247 lenovo overtak hp world top pc maker one market survey hpq dell aapl   RT @ahess247 Lenovo Overtakes HP as World's Top PC Maker In One Market Survey $HPQ $DELL $AAPL
    7722    lenovo  2.56167E+17 1.34991E+12 43468679    cna lenovo top hp world biggest pc maker new york chines manufactur lenovo overtaken us base    CNA - Lenovo tops HP as world's biggest PC maker: NEW YORK: Chinese manufacturer Lenovo has overtaken US-based H... 
    7722    lenovo  2.56167E+17 1.34991E+12 231001548   lenovo top hp world biggest pc maker new york chines manufactur lenovo overtaken us base hewlett    Lenovo tops HP as world's biggest PC maker: NEW YORK: Chinese manufacturer Lenovo has overtaken US-based Hewlett... 
    7722    lenovo  2.5617E+17  1.34991E+12 309407203   hp lenovo battl top spot pc market computerworld    HP, Lenovo battle for top spot in PC market - Computerworld 
    7722    lenovo  2.5617E+17  1.34991E+12 865570603   hp lenovo battl top spot pc market computerworld    HP, Lenovo battle for top spot in PC market - Computerworld 
    7722    lenovo  2.5617E+17  1.34991E+12 865474436   hp lenovo battl top spot pc market computerworld    HP, Lenovo battle for top spot in PC market - Computerworld 
    7722    lenovo  2.5617E+17  1.34991E+12 19961203    reddingnewsblog hp lenovo battl top spot pc market computerworld afphp lenovo battl top spot    ReddingNewsBlog HP, Lenovo battle for top spot in PC market - Computerworld: AFPHP, Lenovo battle for top spot i... 
    7722    lenovo  2.56171E+17 1.34991E+12 131477801   hp lenovo battl top spot pc market computerworld    HP, Lenovo battle for top spot in PC market - Computerworld
    7722    lenovo  2.56171E+17 1.34991E+12 138389154   hp lenovo battl top spot pc market computerworld    HP, Lenovo battle for top spot in PC market - Computerworld
    7722    lenovo  2.56171E+17 1.34991E+12 297753408   hp lenovo battl top spot pc market computerworld afphp lenovo battl top spot pc marketcompu HP, Lenovo battle for top spot in PC market - Computerworld: AFPHP, Lenovo battle for top spot in PC marketCompu... 
    7722    lenovo  2.56174E+17 1.34991E+12 558600336   hp lenovo battl top spot pc market computerworld    HP, Lenovo battle for top spot in PC market - Computerworld 
    7722    lenovo  2.56174E+17 1.34991E+12 367209383   hp lenovo battl top spot pc market computerworld    HP, Lenovo battle for top spot in PC market - Computerworld 
    7722    lenovo  2.56174E+17 1.34991E+12 755374159   hp lenovo battl top spot pc market computerworld    HP, Lenovo battle for top spot in PC market - Computerworld 
    7722    lenovo  2.56174E+17 1.34991E+12 36024932    hp lenovo battl top spot pc market computerworld wall street journalhp lenovo battl top spot    HP, Lenovo battle for top spot in PC market - Computerworld: Wall Street JournalHP, Lenovo battle for top spot i... 
    7722    lenovo  2.56176E+17 1.34991E+12 18437660    lenovo pass hp top pc maker ft  Lenovo passes HP to be top PC maker: #FT
    7722    lenovo  2.56176E+17 1.34991E+12 543944864   hp lenovo battl top spot pc market computerworld googlenew  HP, Lenovo battle for top spot in PC market - Computerworld #googlenews
    7722    lenovo  2.56179E+17 1.34991E+12 113671593   lenovo pass hp top pc maker Lenovo passes HP to be top PC maker

以下是执行eventDetectionToken()之后的输出,它应该删除重复项,但只删除了一些重复项:

[clusterid], [name entitiy], [tweetid], [timestamp], [userid], [tweet token], [tweet text]

    7722    lenovo  2.56143E+17 1.3499E+12  72541972    lenovo top hp becom 1 pc maker zdnet lenovo top hp becom 1 pc makerzdnetsummari china le    Lenovo tops HP to become No. 1 PC maker - ZDNet: Lenovo tops HP to become No. 1 PC makerZDNetSummary: China's Le...
    7722    lenovo  2.56165E+17 1.34991E+12 112115244   lenovo overtak hp world top pc maker q3 Lenovo Overtakes HP as World’s Top PC Maker in Q3
    7722    lenovo  2.56165E+17 1.34991E+12 14886375    ahess247 lenovo overtak hp world top pc maker one market survey hpq dell aapl   RT @ahess247 Lenovo Overtakes HP as World's Top PC Maker In One Market Survey $HPQ $DELL $AAPL
    7722    lenovo  2.56167E+17 1.34991E+12 43468679    cna lenovo top hp world biggest pc maker new york chines manufactur lenovo overtaken us base    CNA - Lenovo tops HP as world's biggest PC maker: NEW YORK: Chinese manufacturer Lenovo has overtaken US-based H...
    7722    lenovo  2.56167E+17 1.34991E+12 231001548   lenovo top hp world biggest pc maker new york chines manufactur lenovo overtaken us base hewlett    Lenovo tops HP as world's biggest PC maker: NEW YORK: Chinese manufacturer Lenovo has overtaken US-based Hewlett...
    7722    lenovo  2.5617E+17  1.34991E+12 865474436   hp lenovo battl top spot pc market computerworld    HP, Lenovo battle for top spot in PC market - Computerworld 
    7722    lenovo  2.5617E+17  1.34991E+12 19961203    reddingnewsblog hp lenovo battl top spot pc market computerworld afphp lenovo battl top spot    ReddingNewsBlog HP, Lenovo battle for top spot in PC market - Computerworld: AFPHP, Lenovo battle for top spot i... 
    7722    lenovo  2.56171E+17 1.34991E+12 138389154   hp lenovo battl top spot pc market computerworld    HP, Lenovo battle for top spot in PC market - Computerworld 
    7722    lenovo  2.56171E+17 1.34991E+12 297753408   hp lenovo battl top spot pc market computerworld afphp lenovo battl top spot pc marketcompu HP, Lenovo battle for top spot in PC market - Computerworld: AFPHP, Lenovo battle for top spot in PC marketCompu... 
    7722    lenovo  2.56174E+17 1.34991E+12 755374159   hp lenovo battl top spot pc market computerworld    HP, Lenovo battle for top spot in PC market - Computerworld 
    7722    lenovo  2.56174E+17 1.34991E+12 36024932    hp lenovo battl top spot pc market computerworld wall street journalhp lenovo battl top spot    HP, Lenovo battle for top spot in PC market - Computerworld: Wall Street JournalHP, Lenovo battle for top spot i... 
    7722    lenovo  2.56176E+17 1.34991E+12 18437660    lenovo pass hp top pc maker ft  Lenovo passes HP to be top PC maker: #FT
    7722    lenovo  2.56176E+17 1.34991E+12 543944864   hp lenovo battl top spot pc market computerworld googlenew  HP, Lenovo battle for top spot in PC market - Computerworld #googlenews
    7722    lenovo  2.56179E+17 1.34991E+12 113671593   lenovo pass hp top pc maker Lenovo passes HP to be top PC maker

输出中仍然存在的推文令牌(列[5])的副本是: hp lenovo battl top spot pc market computerworld

如何删除现有的重复项?

1 个答案:

答案 0 :(得分:2)

您可以使用univocity-parsers轻松解决此问题。它还将比您编写的所有代码更快地解析数据。

// creates a CSV parser
CsvParserSettings settings = new CsvParserSettings(); // configure parse as required
CsvParser parser = new CsvParser(settings);

Set<String> tweets = new HashSet<>();
for(String[] row : parser.iterate(new File("/path/to/input.csv"))){
    if(tweets.contains(row[5])){
        //duplicate, skip.
        continue;
    } else {
        tweets.add(row[5]);
        System.out.println(Arrays.toString(row)); // process the row
    }
}

希望有所帮助

披露:我是这个图书馆的作者。它是开源和免费的(Apache V2.0许可证)。