我在文件中有655,000个“单词”。我想交叉引用用户提供的“单词”,看看我是否可以在文件中找到匹配项。
目前,我只是打开文件,逐行读取,检查值是否相同。
但是这需要很长时间来浏览文件。
有更快的方法进行比较吗?我应该阅读整个文件,然后拆分并比较吗?
我试图“索引”word文件,但这也需要永远。代码
我在一个单独的线程中运行它
文件增长得非常快,两个小时前是10,000“单词”我会认为它会进入百万分之10
我使用术语“单词”,因为该文件包含来自我的第一个神经网络AI的数据,所以不幸的是,引用单词搜索不起作用。
Do While sr.Peek() >= 0
NewWord = (sr.ReadLine())
FirstLetter = NewWord(0)
Wordlength = NewWord.Length
If Wordlength < 5 Then
writefile = "5.txt"
End If
If Wordlength = 6 Then
writefile = "6.txt"
End If
If Wordlength = 7 Then
writefile = "7.txt"
End If
If Wordlength = 8 Then
writefile = "8.txt"
End If
If Wordlength = 9 Then
writefile = "9.txt"
End If
If Wordlength = 10 Then
writefile = "10.txt"
End If
If Wordlength = 11 Then
writefile = "11.txt"
End If
If Wordlength >= 12 Then
writefile = "12.txt"
End If
If LCase(FirstLetter) = "a" Then
Writepath = "H:\Dictionary\A\"
End If
If LCase(FirstLetter) = "b" Then
Writepath = "H:\Dictionary\B\"
End If
If LCase(FirstLetter) = "c" Then
Writepath = "H:\Dictionary\C\"
End If
If LCase(FirstLetter) = "d" Then
Writepath = "H:\Dictionary\D\"
End If
If LCase(FirstLetter) = "e" Then
Writepath = "H:\Dictionary\E\"
End If
If LCase(FirstLetter) = "f" Then
Writepath = "H:\Dictionary\F\"
End If
If LCase(FirstLetter) = "g" Then
Writepath = "H:\Dictionary\G\"
End If
If LCase(FirstLetter) = "h" Then
Writepath = "H:\Dictionary\H\"
End If
If LCase(FirstLetter) = "i" Then
Writepath = "H:\Dictionary\I\"
End If
If LCase(FirstLetter) = "j" Then
Writepath = "H:\Dictionary\J\"
End If
If LCase(FirstLetter) = "k" Then
Writepath = "H:\Dictionary\K\"
End If
If LCase(FirstLetter) = "l" Then
Writepath = "H:\Dictionary\L\"
End If
If LCase(FirstLetter) = "m" Then
Writepath = "H:\Dictionary\M\"
End If
If LCase(FirstLetter) = "n" Then
Writepath = "H:\Dictionary\N\"
End If
If LCase(FirstLetter) = "o" Then
Writepath = "H:\Dictionary\O\"
End If
If LCase(FirstLetter) = "p" Then
Writepath = "H:\Dictionary\P\"
End If
If LCase(FirstLetter) = "q" Then
Writepath = "H:\Dictionary\Q\"
End If
If LCase(FirstLetter) = "r" Then
Writepath = "H:\Dictionary\R\"
End If
If LCase(FirstLetter) = "s" Then
Writepath = "H:\Dictionary\S\"
End If
If LCase(FirstLetter) = "t" Then
Writepath = "H:\Dictionary\T\"
End If
If LCase(FirstLetter) = "u" Then
Writepath = "H:\Dictionary\U\"
End If
If LCase(FirstLetter) = "v" Then
Writepath = "H:\Dictionary\V\"
End If
If LCase(FirstLetter) = "w" Then
Writepath = "H:\Dictionary\W\"
End If
If LCase(FirstLetter) = "x" Then
Writepath = "H:\Dictionary\X\"
End If
If LCase(FirstLetter) = "y" Then
Writepath = "H:\Dictionary\Y\"
End If
If LCase(FirstLetter) = "z" Then
Writepath = "H:\Dictionary\Z\"
End If
outputpath = Writepath & writefile
Using sw As StreamWriter = File.AppendText(outputpath)
sw.WriteLine(NewWord)
End Using
progressvalue = progressvalue + 1
Loop
答案 0 :(得分:1)
散列数据结构(例如.NET中的HashSet
)将是添加和检查单词的最快方法,但是当您添加更多单词时,最终会耗尽内存。
数据库应该是最好的,因为单词将被编入索引,您可以从多台计算机访问它。
使用文件系统很可能是最慢的方式,但我猜测使用文件夹名称而不是文件应该更快。例如,对于单词Foo
,路径将为"H:\Dictionary\F\O\O\"
(大写或小写在我所知道的大多数流行文件系统上无关紧要),但它也将使用更多空间作为每个文件夹将有单独的元数据信息和设置。
如果项目有一些预算,您可以搜索更好的解决方案,例如Google BigQuery。