我目前正在处理各种统计的申请。 一个任务是分析大量的句子用于他们的字数。
规格如下:
这是我目前的做法:
db.execSQL("create temp table if not exists WORDS (WORD varchar, SENT integer)");
Cursor c1 = db.rawQuery("select lower(MSG) as SENTENCE, SENT from MESSAGELIST",null);
while (c.moveToNext()) {
String[] words = c.getString(c.getColumnIndex("SENTENCE")).split(
"\\s+");
int from_me = c.getInt(c.getColumnIndex("SENT"));
for (int i = 0; i < words.length; i++) {
words[i] = words[i].replaceAll("[^a-zA-z]", "");
if (!words[i].equals("")) {
db.execSQL("insert into WORDS values ('" + words[i] + "', "
+ from_me + ")");
}
}
}
Cursor c2 = db.rawQuery(
"select WORD, COUNT(*) as CNT from WORDS where SENT=0 group by WORD order by CNT desc limit 10",
null);
Cursor c3 = db.rawQuery(
"select WORD, COUNT(*) as CNT from WORDS where SENT=1 group by WORD order by CNT desc limit 10",
null);
正如我已经假设这段代码很慢。我猜字符串操作需要花费很多时间。
从查询中提取并重新进入数据库也感觉不对。但是,我知道regexp_split_to_array
中有regexp_split_to_table
和PostgreSQL
,这样可以保留数据库以进行查询。我还没有找到在SQLite
我花了很多时间试图找出不同的解决方案,但现在我有点陷入困境。有没有(相对)快速的方法来执行所需的任务?我也愿意让wordcount尽可能合理。
包含一些建议实施的当前版本:
改进:
HashMultiset计数:快2%
c = db.rawQuery("select lower(DATA) as SENTENCE, SENT from MESSAGELIST", null);
CharMatcher pat_rep = CharMatcher.inRange('A', 'Z')
.or(CharMatcher.inRange('a', 'z')).precomputed();
Pattern pat_split = Pattern.compile("\\s");
HashMultiset<String> sent = HashMultiset.create();
HashMultiset<String> rcvd = HashMultiset.create();
while (c.moveToNext()) {
String[] words = pat_split.split(c.getString(c.getColumnIndex("SENTENCE")));
int from_me = c.getInt(c.getColumnIndex("SENT"));
for (int i = 0; i < words.length; i++) {
words[i] = pat_rep.retainFrom(words[i]);
if (!words[i].equals("")) {
if (from_me == 1) {
sent.add(words[i]);
} else {
rcvd.add(words[i]);
}
}
}
}
db.execSQL("create temp table if not exists WORDS (WORD varchar, SENT integer, CNT integer)");
SQLiteStatement ins = db.compileStatement("insert into WORDS values (?, ?, ?)");
db.beginTransaction();
Iterator<String> i = sent.iterator();
while (i.hasNext()) {
String in = i.next();
ins.bindString(1, in);
ins.bindLong(2, 1);
ins.bindLong(3, sent.count(in));
ins.executeInsert();
ins.clearBindings();
}
i = rcvd.iterator();
while (i.hasNext()) {
String in = i.next();
ins.bindString(1, in);
ins.bindLong(2, 0);
ins.bindLong(3, rcvd.count(in));
ins.executeInsert();
ins.clearBindings();
}
db.setTransactionSuccessful();
db.endTransaction();
c = db.rawQuery(
"select WORD, CNT from WORDS where SENT=0 group by WORD order by CNT desc limit 10",
null);
Cursor c2 = db.rawQuery(
"select WORD, CNT from WORDS where SENT=1 group by WORD order by CNT desc limit 10",
null);
答案 0 :(得分:1)
db.execSQL("insert into WORDS values ('" + words[i] + "', "
+ from_me + ")");
数据库访问太多。为每个单词命中数据库都不顺利。由于重复了很多单词,你可以在Multiset中计算它们,并在它们的计数,内存紧张或你完成时存储它们。
为每次出现创建一个单独的行也没有意义。添加一个列count
(最好将其称为&#34; count&#34;是关键字)。
使用准备好的陈述。通过每次创建一个查询字符串,您可以强制DB一次又一次地解析它。并且还为GC工作。
words[i] = words[i].replaceAll("[^a-zA-z]", "");
使用Pattern.compile或CharMatcher。在没有特殊字符的常见情况下,后者不会产生垃圾。
private final CharMatcher alpha = CharMatcher.inRange('A', 'Z')
.or(CharMatcher.inRange('a', 'z')).precomputed();
alpha.retainFrom(words[i]);
这应该有很多帮助,特别是DB的东西。尝试一下,如果还不够,请再来一次。