Question

我目前正在处理的项目让我读取文件，然后对数据内部的数据进行分析。使用FileReader我已经将文件的每一行读入一个数组。该文件如下所示：

01 02 03 04 05 06
  02 03 04 05 06 07
  03 04 05 06 07 08
  04 05 06 07 08 09

这些不是确切的数字，但它们就是一个很好的例子。我现在正试图找出多少次说数字＆＃34; 04＆＃34;出现在我的数据列表中。我想通过分开每一行来将所有数据放在一个二维数组中，但我不太清楚如何做到这一点。我是否需要解析器，或者是否可以使用某种类型的字符串函数（如拆分）将这些数据拆分，然后将其存储到数组中？

Answer 1

如果您只需要计算04，那么您真的不需要存储整个文件。例如，您可以读取每一行并检查它04（并添加到计数器，或其他）。你甚至可以逐字逐句阅读，但这对于轻微（如果有的话）效率提升来说可能有点乏味。

如果您需要对文件执行的处理更复杂，则此方法可能无法完成任务。但除非你指明那是什么，否则我不能说它是不是。

Answer 2

您应该使用Map来保存发生次数，如下所示：

public static void main(String[] args) throws IOException {
Pattern splitter = Pattern.compile("\\s+");
try(Stream<String> stream = Files.lines(Paths.get("input.txt"))) {
    Map<String,Long> result = stream.flatMap(splitter::splitAsStream)
            .collect(Collectors.groupingBy(Function.identity(),
                    Collectors.counting()));
    System.out.println(result);
}}

或者加载数据并分多个阶段进行解析：

public static void main(String[] args) throws IOException {
    // 1. load the data array
    String[][] data;
    try(Stream<String> stream = Files.lines(Paths.get("numbers.txt"))) {
        data = stream.map(line -> line.split("\\s+")).toArray(String[][]::new);
    }
    System.out.format("Total lines = %d%n", data.length);

    // 2. count the occurrences of each word
    Map<String,Long> countDistinct = Arrays.stream(data).flatMap(Arrays::stream)
            .collect(Collectors.groupingBy(Function.identity(),
                    Collectors.counting()));
    System.out.println("Count of 04 = " + countDistinct.getOrDefault("04", 0L));

    // 3. calculate correlations 
    Map<String,Map<String,Long>> correlations;
    correlations = Arrays.stream(data).flatMap((String[] row) -> {
        Set<String> words = new HashSet<>(Arrays.asList(row));
        return words.stream().map(word -> new AbstractMap.SimpleEntry<>(word, words));
    }).collect(Collectors.toMap(kv -> kv.getKey(),
            kv -> kv.getValue().stream()
                    .collect(Collectors.toMap(Function.identity(), v -> 1L)),
            (map1, map2) -> {
                map2.entrySet().forEach(kv -> map1.merge(kv.getKey(), kv.getValue(), Long::sum));
                return map1;
            }));
    System.out.format("Lines with 04 = %d%n",           
        correlations.getOrDefault("04", Collections.EMPTY_MAP).getOrDefault("04", 0L));
    System.out.format("Lines with both 04 and 07 = %d%n",           
        correlations.getOrDefault("04", Collections.EMPTY_MAP).getOrDefault("07", 0L));
}

编辑：

这是一个（可能）更容易阅读的版本，它不使用流/功能方法：

public static void main(String[] args) throws IOException {
    long lineCount = 0;
    Map<String,Long> wordCount = new HashMap<>();
    Map<String,Map<String,Long>> correlations = new HashMap<>();
    try(Stream<String> stream = Files.lines(Paths.get("numbers.txt"))) {
        Iterable<String> lines = stream::iterator;
        Set<String> lineWords = new HashSet<>();
        for(String line : lines) {
            lineCount++;
            for(String word : line.split("\\s+")) {
                lineWords.add(word);
                wordCount.merge(word, 1L, Long::sum);
            }
            for(String wordA : lineWords) {
                Map<String,Long> relate = correlations.computeIfAbsent(wordA,
                        key -> new HashMap<>());
                for(String wordB : lineWords) {
                    relate.merge(wordB, 1L, Long::sum);
                }
            }
        }
    }
    System.out.format("Total lines = %d%n", lineCount);
    System.out.println("Count of 04 = " + wordCount.getOrDefault("04", 0L));
    System.out.format("Lines with 04 = %d%n",           
        correlations.getOrDefault("04", Collections.EMPTY_MAP).getOrDefault("04", 0L));
    System.out.format("Lines with both 04 and 07 = %d%n",           
        correlations.getOrDefault("04", Collections.EMPTY_MAP).getOrDefault("07", 0L));
}

<强>输出：

总行数= 4

04的数量= 4

04 = 4
的行
04和07 = 3
的行

Answer 3

编辑：我误读你已经将文件读入数组了。所以你可以跳过处理子串中数组中的每个条目。

假设您正在使用输入的文本文件或类似文件，您可以逐行读取文件，并在读取时计算每行中“04”的数量。您可以使用这样的缓冲读卡器：

String line;
while ((line = br.readLine()) != null) {
    //process each line
}

要计算所需字符串的出现次数，可以参考另一个答案：

Occurrences of substring in a string

Answer 4

你的设计理念“不成熟”;例如，在这里使用2D数组。

您可以看到，在开始考虑设计/实施选择之前，您必须更好地了解要求。

示例：当您关心测量一些数字显示整体的频率时，那么使用2D数组将无济于事。相反，您可以将所有数字放入一个长List<Integer>;然后，例如，使用一些花哨的java8流操作。

但如果那只是一个示例中的一个，那么在内存中管理数据的其他方法可能会更有效。

除此之外：如果你发现你将使用这些数据做的事情超出了简单的计算 - Java可能不是这里的最佳选择。你知道，像 R 这样的语言是专门为这样做而设计的：处理大量数据;并让您“即时”访问广泛范围的各种统计操作。

回答你关于计算所有不同数字的出现的想法; 真的简单：你在这里使用Map<Integer, Integer>;喜欢在：

Map<Integer, Integer> numbersWithCount = new HashMap<>();

now you loop over your data; and for each data point:

int currentNumber = ... next number from your input data

int counterForNum;
if (numbersWithCount.containsKey(currentNumber)) {
  counterForNum = numbersWithCount.get(currentNumber) + 1;
} else {
   // currentNumber found the first time
  counterForCurrentNumber = 1;
}
numbersWithCount.put(currentNumber);

换句话说：你只是迭代所有传入的数字，对于每一个，你要么创建一个新的计数器;或者增加已经存储的一个。

如果您使用TreeMap而不是HashMap，您甚至可以对密钥进行排序。那里有很多可能......

阅读文件内容，然后对其进行分析

4 个答案: