Question

我正在尝试根据术语频率计数“重新创建”音乐歌词。我有两个源数据文件。第一个是我正在使用的歌词语料库中5000个最常用术语的列表，按从最常用（1）到最少使用（5000）的顺序排列。第二个文件是歌词语料库本身，由超过200,000首歌曲组成。

每个“歌曲”是逗号分隔的字符串，如下所示：“SONGID1，SONGID2,1：13,2：10,4：6,7：15，......”其中前两个条目是ID歌曲的标签，后跟术语（冒号左边的数字）和歌曲中使用术语的次数（冒号右边的数字）。在上面的示例中，这意味着“I”（5000个最常用术语中的第一个条目“1”）在此给定歌曲中出现13次，而“the”（第二个最常用的术语）出现10次等等。

我想要做的是从这个“termID：termCount”格式转到实际“重新创建”原始（尽管是乱码）歌词，在那里我将冒号左边的数字设置为实际术语，然后列出这些如果术语计数在冒号右侧，则适当的次数。再次，使用上面的简短示例，我首选的结果输出将是：“SONGID1，SONGID2，I I I I I I I I I I I I I I the the the the the the the the and the and and and and and .......”等等。谢谢！

Answer 1

也许以下（未经测试）会激励你。您没有说您想要输出的内容，因此您可能希望将print()更改为文件写入或其他内容。

//assumes that each word is on its own line, sorted from most to least common
String[] words = loadStrings("words.txt");

//two approaches: 
//loadStrings() again, but a lot of memory usage for big files. 
//buffered reader, which is more complicated but works well for large files.
BufferedReader reader = createReader("songs.txt");
String line = reader.readLine();
while(line != null){
  String[] data = line.split(",");
  print(data[0] + ", " + data[1]); //the two song IDs
  for(int i = 2; i < data.length; i++){ 
    String[] pair = data[i].split(":");
    // inelegant, but clear. You may have to subtract 1, if
    // the words index from 1 but the array indexes from 0
    for(int j = 0; j < int(pair[1]); j++)
      print(words[int(pair[0])] + " ");
  }
  println();
  line = reader.readLine();
}
reader.close();

从术语频率计数（数字）重新创建歌词（单词）

1 个答案: