计算文档中字符串的唯一出现次数

时间:2014-12-05 22:01:34

标签: java arraylist

我正在将日志文件读入java。对于日志文件中的每一行,我正在检查该行是否包含IP地址。如果该行包含一个IP地址,那么我想要+1到ip地址出现在日志文件中的次数。我怎样才能在Java中实现这一目标?

下面的代码成功地从包含ip地址的每一行中提取ip地址,但是计算ip地址出现次数的过程不起作用。

void read(String fileName) throws IOException {
    BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(fileName)));
    int counter = 0;
    ArrayList<IPHolder> ips = new ArrayList<IPHolder>();
    try {
        String line;
        while ((line = br.readLine()) != null) {
            if(!getIP(line).equals("0.0.0.0")){
                if(ips.size()==0){
                    IPHolder newIP = new IPHolder();
                    newIP.setIp(getIP(line));
                    newIP.setCount(0);
                    ips.add(newIP);
                }
                for(int j=0;j<ips.size();j++){
                    if(ips.get(j).getIp().equals(getIP(line))){
                        ips.get(j).setCount(ips.get(j).getCount()+1);
                    }else{
                        IPHolder newIP = new IPHolder();
                        newIP.setIp(getIP(line));
                        newIP.setCount(0);
                        ips.add(newIP);
                    }
                }
                if(counter % 1000 == 0){System.out.println(counter+", "+ips.size());}
                counter+=1;
            }
        }
    } finally {br.close();}
    for(int k=0;k<ips.size();k++){
        System.out.println("ip, count: "+ips.get(k).getIp()+" , "+ips.get(k).getCount());
    }
}

public String getIP(String ipString){//extracts an ip from a string if the string contains an ip
    String IPADDRESS_PATTERN = 
    "(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)";

    Pattern pattern = Pattern.compile(IPADDRESS_PATTERN);
    Matcher matcher = pattern.matcher(ipString);
    if (matcher.find()) {
        return matcher.group();
    }
    else{
        return "0.0.0.0";
    }
}

持有人类是:

public class IPHolder {

    private String ip;
    private int count;

    public String getIp(){return ip;}
    public void setIp(String i){ip=i;}

    public int getCount(){return count;}
    public void setCount(int ct){count=ct;}
}

2 个答案:

答案 0 :(得分:1)

在这种情况下,搜索的关键词是HashMap。 HashMap是键值对的列表(在这种情况下是成对的ips及其计数)。

"192.168.1.12" - 12
"192.168.1.13" - 17
"192.168.1.14" - 9

等等。 使用和访问比总是遍历容器对象数组更容易,以确定是否已经有一个容器用于该ip。

BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(/*Your file */)));

HashMap<String, Integer> occurrences = new HashMap<String, Integer>();

String line = null;

while( (line = br.readLine()) != null) {

    // Iterate over lines and search for ip address patterns
    String[] addressesFoundInLine = ...;


    for(String ip: addressesFoundInLine ) {

        // Did you already have that address in your file earlier? If yes, increase its counter by 
        if(occurrences.containsKey(ip))
            occurrences.put(ip, occurrences.get(ip)+1);

        // If not, create a new entry for this address
        else
            occurrences.put(ip, 1);
    } 
}


// TreeMaps are automatically orered if their elements implement 'Comparable' which is the case for strings and integers
TreeMap<Integer, ArrayList<String>> turnedAround = new TreeMap<Integer, ArrayList<String>>();

Set<Entry<String, Integer>> es = occurrences.entrySet();

// Switch keys and values of HashMap and create a new TreeMap (in case there are two ips with the same count, add them to a list)
for(Entry<String, Integer> en: es) {

    if(turnedAround.containsKey(en.getValue()))         
        turnedAround.get(en.getValue()).add((String) en.getKey());
    else {
        ArrayList<String> ips = new ArrayList<String>();
        ips.add(en.getKey());
        turnedAround.put(en.getValue(), ips);
    }

}

// Print out the values (if there are two ips with the same counts they are printed out without an special order, that would require another sorting step)
for(Entry<Integer, ArrayList<String>> entry: turnedAround.entrySet()) {         
    for(String s: entry.getValue())
        System.out.println(s + " - " + entry.getKey());         
}

在我的情况下,输出如下:

192.168.1.19 - 4
192.168.1.18 - 7
192.168.1.27 - 19
192.168.1.13 - 19
192.168.1.12 - 28

大约半小时前我回答this question,我猜这正是你要搜索的内容,所以如果你需要一些示例代码,请看一下。

答案 1 :(得分:0)

以下是一些代码,它使用HashMap存储IP,并使用正则表达式在每行中匹配它们。它使用try-with-resources自动关闭文件。

编辑:我添加了代码以降序打印,就像你在另一个答案中提到的那样。

    void read(String fileName) throws IOException {
    //Step 1 find and register IPs and store their occurence counts
    HashMap<String, Integer> ipAddressCounts = new HashMap<>();
    try (BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(fileName)))) {
        Pattern findIPAddrPattern = Pattern.compile("((\\d+.){3}\\d+)");
        String line;
        while ((line = br.readLine()) != null) {
            Matcher matcher = findIPAddrPattern.matcher(line);
            while (matcher.find()) {
                String ipAddr = matcher.group(0);
                if ( ipAddressCounts.get(ipAddr) == null ) {
                    ipAddressCounts.put(ipAddr, 1);
                }
                else {
                    ipAddressCounts.put(ipAddr, ipAddressCounts.get(ipAddr) + 1);
                }
            }
        }
    }

    //Step 2 reverse the map to store IPs by their frequency
    HashMap<Integer, HashSet<String>> countToAddrs = new HashMap<>();
    for (Map.Entry<String, Integer> entry : ipAddressCounts.entrySet()) {
        Integer count = entry.getValue();
        if ( countToAddrs.get(count) == null )
            countToAddrs.put(count, new HashSet<String>());
        countToAddrs.get(count).add(entry.getKey());
    }

    //Step 3 sort and print the ip addreses, most frequent first
    ArrayList<Integer> allCounts = new ArrayList<>(countToAddrs.keySet());
    Collections.sort(allCounts, Collections.reverseOrder());
    for (Integer count : allCounts) {
        for (String ip : countToAddrs.get(count)) {
            System.out.println("ip, count: " + ip + " , " + count);
        }
    }
}