我在SO上找到了这个解决方案来检测字符串中的n-gram: (这里:N-gram generation from a sentence)
import java.util.*;
public class Test {
public static List<String> ngrams(int n, String str) {
List<String> ngrams = new ArrayList<String>();
String[] words = str.split(" ");
for (int i = 0; i < words.length - n + 1; i++)
ngrams.add(concat(words, i, i+n));
return ngrams;
}
public static String concat(String[] words, int start, int end) {
StringBuilder sb = new StringBuilder();
for (int i = start; i < end; i++)
sb.append((i > start ? " " : "") + words[i]);
return sb.toString();
}
public static void main(String[] args) {
for (int n = 1; n <= 3; n++) {
for (String ngram : ngrams(n, "This is my car."))
System.out.println(ngram);
System.out.println();
}
}
}
=&GT;这段代码需要最长的处理时间(对于我的语料库检测1克,2克,3克和4克为28秒:4Mb原始文本),而其他操作的毫秒数(删除)关键词等。)
有人知道Java中的解决方案会比上面提到的循环解决方案更快吗? (我在考虑使用多线程,使用集合,或者创造性地分割字符串......?)谢谢!
答案 0 :(得分:3)
您可以尝试这样的事情:
public class NGram {
private final int n;
private final String text;
private final int[] indexes;
private int index = -1;
private int found = 0;
public NGram(String text, int n) {
this.text = text;
this.n = n;
indexes = new int[n];
}
private boolean seek() {
if (index >= text.length()) {
return false;
}
push();
while(++index < text.length()) {
if (text.charAt(index) == ' ') {
found++;
if (found<n) {
push();
} else {
return true;
}
}
}
return true;
}
private void push() {
for (int i = 0; i < n-1; i++) {
indexes[i] = indexes[i+1];
}
indexes[n-1] = index+1;
}
private List<String> list() {
List<String> ngrams = new ArrayList<String>();
while (seek()) {
ngrams.add(get());
}
return ngrams;
}
private String get() {
return text.substring(indexes[0], index);
}
}
测试大约5mb的文本似乎比原始代码快10倍左右。这里的主要区别是正则表达式不用于拆分,并且ngram字符串不是通过连接创建的。
更新: 这是我在上面提到的文本ngram 1-4上运行时得到的输出。我运行2GB内存来决定运行期间对GC的影响。我多次运行以查看热点编译器的影响。
Loop 01 Code mine ngram 1 time 071ms ngrams 294121
Loop 01 Code orig ngram 1 time 534ms ngrams 294121
Loop 01 Code mine ngram 2 time 016ms ngrams 294120
Loop 01 Code orig ngram 2 time 360ms ngrams 294120
Loop 01 Code mine ngram 3 time 082ms ngrams 294119
Loop 01 Code orig ngram 3 time 319ms ngrams 294119
Loop 01 Code mine ngram 4 time 014ms ngrams 294118
Loop 01 Code orig ngram 4 time 439ms ngrams 294118
Loop 10 Code mine ngram 1 time 013ms ngrams 294121
Loop 10 Code orig ngram 1 time 268ms ngrams 294121
Loop 10 Code mine ngram 2 time 014ms ngrams 294120
Loop 10 Code orig ngram 2 time 323ms ngrams 294120
Loop 10 Code mine ngram 3 time 013ms ngrams 294119
Loop 10 Code orig ngram 3 time 412ms ngrams 294119
Loop 10 Code mine ngram 4 time 014ms ngrams 294118
Loop 10 Code orig ngram 4 time 423ms ngrams 294118
答案 1 :(得分:0)
通过您提供的代码运行大约5兆的Lorus Ipsum文本通常需要大约7秒多的时间才能检测到1到4 n-gram。我重新编写代码以制作最长n-gram的列表,然后遍历此列表,生成连续更短的ngrams列表。在测试中,同一文本花了大约2.6秒。此外,它耗费了更少的内存。
import java.util.*;
public class Test {
public static List<String> ngrams(int max, String val) {
List<String> out = new ArrayList<String>(1000);
String[] words = val.split(" ");
for (int i = 0; i < words.length - max + 1; i++) {
out.add(makeString(words, i, max));
}
return out;
}
public static String makeString(String[] words, int start, int length) {
StringBuilder tmp= new StringBuilder(100);
for (int i = start; i < start + length; i++) {
tmp.append(words[i]).append(" ");
}
return tmp.substring(0, tmp.length() - 1);
}
public static List<String> reduceNgrams(List<String> in, int size) {
if (1 < size) {
List<String> working = reduceByOne(in);
in.addAll(working);
for (int i = size -2 ; i > 0; i--) {
working = reduceByOne(working);
in.addAll(working);
}
}
return in;
}
public static List<String> reduceByOne(List<String> in) {
List<String> out = new ArrayList<String>(in.size());
int end;
for (String s : in) {
end = s.lastIndexOf(" ");
out.add(s.substring(0, -1 == end ? s.length() : end));
}
//the last one will always reduce twice - words 0, n-1 are in the loop this catches the words 1, n
String s = in.get(in.size() -1);
out.add(s.substring(s.indexOf(" ")+1));
return out;
}
public static void main(String[] args) {
long start;
start = System.currentTimeMillis();
List<String> ngrams = ngrams(3, "Your text goes here, actual mileage may vary");
reduceNgrams(ngrams, 3);
System.out.println(System.currentTimeMillis() - start);
}
}