我有14个准备好的基于单词的3克文件,txt文件的总大小是75GB。 ngram由";"分隔。并且单词序列后面的单词由" |"分隔。现在我想计算一个单词遵循3字序列的频率。由于我需要尽可能快地完成数据量。
我的方法是:
;
|
sequences
和words
中,并计算该单词在words
表我有SQL Server 2014 Express,我的表具有以下结构:
[dbo].[sequences]
:Id | Sequence
[dbo].[words]
:Id | sid | word | count
序列表应该是清晰的,在单词表中sid
是相关的序列id,单词是单词字符串,count是int数字,它计算单词在该序列之后出现的频率
我的以下解决方案需要在每行开始大约1秒,这是非常慢的。我试图使用Parallel,但后来我得到一个SQL错误,我猜是因为当另一个进程插入某些东西时表被锁定。
我的节目:
static void Main(string[] args)
{
DateTime begin = DateTime.Now;
SqlConnection myConnection = new SqlConnection(@"Data Source=(localdb)\Projects;Database=ngrams;Integrated Security=True;Connect Timeout=30;Encrypt=False;TrustServerCertificate=False");
myConnection.Open();
for (int i = 0; i < 14; i++)
{
using (FileStream fs = File.Open(@"F:\Documents\ngrams\prepared_" + i + ".txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
string line;
int a = 0;
while ((line = sr.ReadLine()) != null)
{
string[] ngrams = line.Split(new char[] { ';' });
foreach (string ngram in ngrams)
{
string[] gram = ngram.Split(new Char[] { '|' });
if (gram.Length > 1)
{
string sequence = gram[0];
string word = gram[1];
storeNgrams(myConnection, sequence, word);
}
}
Console.WriteLine(DateTime.Now.Subtract(begin).TotalMinutes);
a++;
}
}
}
Console.WriteLine("Processed 75 Gigabyte in hours: " + DateTime.Now.Subtract(begin).TotalHours);
}
private static void storeNgrams(SqlConnection myConnection, string sequence, string word)
{
SqlCommand insSeq = new SqlCommand("INSERT INTO sequences (sequence) VALUES (@sequence); SELECT SCOPE_IDENTITY()", myConnection);
SqlCommand insWord = new SqlCommand("INSERT INTO words (sid, word, count) VALUES (@sid, @word, @count)", myConnection);
SqlCommand updateWordCount = new SqlCommand("UPDATE words SET count = @count WHERE sid = @sid AND word = @word", myConnection);
SqlCommand searchSeq = new SqlCommand("SELECT Id from sequences WHERE sequence = @sequence", myConnection);
SqlCommand getWordCount = new SqlCommand("Select count from words WHERE sid = @sid AND word = @word", myConnection);
searchSeq.Parameters.AddWithValue("@sequence", sequence);
object searchSeq_obj = searchSeq.ExecuteScalar();
if (searchSeq_obj != null)
{
insNgram(insWord, updateWordCount, getWordCount, searchSeq_obj, word).ExecuteNonQuery();
}
else
{
insSeq.Parameters.AddWithValue("@sequence", sequence);
object sid_obj = insSeq.ExecuteScalar();
if (sid_obj != null)
{
insNgram(insWord, updateWordCount, getWordCount, sid_obj, word).ExecuteNonQuery();
}
}
}
private static SqlCommand insNgram(SqlCommand insWord, SqlCommand updateWordCount, SqlCommand getWordCount, object sid_obj, string word)
{
int sid = Convert.ToInt32(sid_obj);
getWordCount.Parameters.AddWithValue("@sid", sid);
getWordCount.Parameters.AddWithValue("@word", word);
object wordCount_obj = getWordCount.ExecuteScalar();
if (wordCount_obj != null)
{
int wordCount = Convert.ToInt32(wordCount_obj) + 1;
return storeWord(updateWordCount, sid, word, wordCount);
}
else
{
int wordCount = 1;
return storeWord(insWord, sid, word, wordCount);
}
}
private static SqlCommand storeWord(SqlCommand updateWord, int sid, string word, int wordCount)
{
updateWord.Parameters.AddWithValue("@sid", sid);
updateWord.Parameters.AddWithValue("@word", word);
updateWord.Parameters.AddWithValue("@count", wordCount);
return updateWord;
}
如何更快地处理ngrams,以便我不需要过多的时间?
P.S。:我对C#和自然语言处理完全陌生。
修改1 : 根据要求提供样本ngram,每行约4或5(但当然有不同的单词组合):大致相同|像;
编辑2: 当我将代码更改为以下内容时,我收到错误 System.AggregateException:至少发生一次失败---&gt; System.InvalidOperationException:已经有一个与此命令关联的打开DataReader,必须先关闭。,就像here一样。
Parallel.For(0, 14, i => sqlaction(myConnection, i, begin));
编辑3: 将 MultipleActiveResultSets = true 添加到连接字符串时,我不会使用Parallel获得任何错误。我用Parallel等效替换了所有相关的循环,并且我遍历所有文件只计算行号(169521628行),我也计算了1行所需的平均时间,即0,051502946秒。即便如此,我还需要大约101天!