Question

假设我有一个看起来像这样的文本文件（包括filename matches部分）：

filename    matches
bugs.txt    5
bugs.txt    3
bugs.txt    12
fish.txt    4
fish.txt    67
birds.txt    34

等...

我想创建一个新的文本文件，每行代表一个文件名，其中包含以下信息：filename, number of times filename appears, sum of matches

所以前三行是：

bugs.txt    3    20
fish.txt    2    71
birds.txt   1    34

原始文本文件的第一行（包含文本filename /t matches对我来说很难。有什么建议吗？

这是我的代码并没有完全解决问题（关闭一个错误......）：

h = null
instances = 0
matches = 0

f.eachLine { line ->

String[] data = line.split (/\t/)

if (line =~ /filename.*/) {}

else {
    source = data[0]  

    if ( source == h) {
        instances ++
        matches = matches + data[9]
    }
    else {
        println h + '\t' + instances + '\t' + matches
        instances = 0   
        matches = 0
        h = source
    }    
} 
}

注意：data []的索引对应于我正在使用的实际文本文件

Answer 1

我想出了这个（使用虚拟数据）

// In reality, you can get this with:
// def text = new File( 'file.txt' ).text
def text = '''filename\tmatches
             |bugs.txt\t5
             |bugs.txt\t3
             |bugs.txt\t12
             |fish.txt\t4
             |fish.txt\t67
             |birds.txt\t34'''.stripMargin()

text.split( /\n|\r|\n\r|\r\n/ ).                                // split based on newline
     drop(1)*.                                                  // drop the header line
     split( /\t/ ).                                             // then split each of these by tab
     collect { [ it[ 0 ], it[ 1 ] as int ] }.                   // convert the second element to int
     groupBy { it[ 0 ] }.                                       // group into a map by filename
     collect { k, v -> [ k, v.size(), v*.getAt( 1 ).sum() ] }*. // then make a list of file,nfiles,sum
     join( '\t' ).                                              // join each of these into a string separated by tab
     each {                                                     // then print them out
       println it
     }

显然，这会将整个文件一次性加载到内存中......

Answer 2

您的代码存在的主要问题是：

当匹配项位于第1列时，您正在使用data[9]
您在source == h
由于您在文件名更改时仅println，因此不输出最后一个文件的结果

这是一个更简单的实现，可以在地图中累积结果：

// this will store a map of filename -> list of matches
// e.g. ['bugs.txt': [5, 3, 12], ...]
def fileMatches = [:].withDefault{[]}

new File('file.txt').eachLine { line ->
    // skip the header line
    if (!(line =~ /filename.*/)) {
        def (source, matches) = line.split (/\t/)
        // append number of matches source's list
        fileMatches[source] << (matches as int)
    }
}
fileMatches.each { source, matches ->
    println "$source\t${matches.size()}\t${matches.sum()}"
}

如何计算文本文件Groovy中的匹配块

2 个答案: