背景我有两个或更多的文件需要搜索匹配。这些文件可以轻松拥有超过20,000行。我需要找到最快的方法来搜索它们并找到文件之间的匹配。
我从来没有像这样匹配,我可能会有不止一场比赛,我需要全部归还。
我所知道的:
我当前的方法涉及过度使用IEnumerable LINQ方法。
Dim fileNames As String() = lstFiles.Items.OfType(Of String)().ToArray()
Dim fileText As IEnumerable(Of IEnumerable(Of CCDDetail)) = fileNames.Select(Function(fileName, fileIndex)
Dim list As New List(Of String)({fileName})
list.AddRange(File.ReadAllLines(fileName))
Return list.Where(Function(fileLine, lineIndex) Not {list.Count - 1, list.Count - 2, 0, 1, 2}.Contains(lineIndex)).
Select(Function(fileLine) New CCDDetail(list(0), fileLine.Substring(12, 17).Trim(), fileLine.Substring(29, 10).Trim(), fileLine.Substring(39, 8).Trim(), fileLine.Substring(48, 6).Trim(), fileLine.Substring(54, 22).Trim()))
End Function)
Dim asdf = fileText.
Select(Function(file, inx) file.
Select(Function(fileLine, ix) fileText.
Skip(inx + 1).
Select(Function(fileToSearch) fileLine.MatchesAny(ix, fileToSearch)).
Aggregate(New List(Of Integer)(), Function(old, cc)
Dim lcc As New List(Of Integer)(cc)
lcc.Insert(0, If(old.Count > 0, old(0) + 1, 1))
old.AddRange(lcc)
Return old
End Function)))
CCDDetail中的函数:
Public Function Matches(ccd2 As CCDDetail) As Boolean
Return CustomerName = ccd2.CustomerName OrElse
DfiAccountNumber = ccd2.DfiAccountNumber OrElse
CustomerRefId = ccd2.CustomerRefId OrElse
PaymentAmount = ccd2.PaymentAmount OrElse
PaymentId = ccd2.PaymentId
End Function
Public Function MatchesAny(index As Integer, ccd2 As IEnumerable(Of CCDDetail)) As IEnumerable(Of Integer)
Return Enumerable.Range(0, ccd2.Count).Where(Function(i) ccd2(i).Matches(Me))
End Function
这适用于我的测试文件,但是使用全长文件时大约需要7分钟。
问题:
有更快的方法吗?任何表现提示?
感谢。
更新:
我只是使用词典和正则表达式列表减少了很多。我将完成该应用程序,然后进行一些比较变化的测试。
Dim fileNames As String() = lstFiles.Items.OfType(Of String)().ToArray()
Dim textFiles As New List(Of Dictionary(Of Integer, CCDDetail))()
Dim fileInnerText As String()
Dim reg As Regex = New Regex(".{12}(.{17})(.{10})(.{8}).(.{6})(.{22})", RegexOptions.Compiled)
Dim mat As Match
Dim fileSpecText As Dictionary(Of Integer, CCDDetail)
Dim lineMatches As New List(Of Integer())
For i As Integer = 0 To fileNames.Length - 1
fileInnerText = File.ReadAllLines(fileNames(i))
fileSpecText = New Dictionary(Of Integer, CCDDetail)()
For j As Integer = 2 To fileInnerText.Length - 3
mat = reg.Match(fileInnerText(j))
fileSpecText.Add(j, New CCDDetail(mat.Groups(1).Value, mat.Groups(2).Value, mat.Groups(3).Value, mat.Groups(4).Value, mat.Groups(5).Value))
Next
textFiles.Add(fileSpecText)
Next
For i As Integer = 0 To textFiles.Count - 1
'Dim source As Dictionary(Of Integer, CCDDetail) = textFiles(i)
For j As Integer = 2 To textFiles(i).Count - 1 + 2
For k As Integer = i + 1 To textFiles.Count - 1
For l As Integer = 2 To textFiles(k).Count - 1 + 2
If (textFiles(i)(j).Matches(textFiles(k)(l))) Then
lineMatches.Add({i, j, k, l})
End If
Next
Next
Next
Next
答案 0 :(得分:1)
请查看我对您问题的评论。以下(未经测试的)示例代码显示了如何使用Dictionary<>可能加快速度。它需要你的“更新”并从那里构建,以便你可以按照我的C#示例(对不起,我不写VB.net)。我们的想法是,使用您的字段作为键来查找所有匹配的行(具有相同字段值的行)会更快。
您的代码(和我的代码)可以进一步改进,不会立即将所有文件加载到内存中,并且在比较两个文件时,您只需要一次加载到字典中。
public void CompareLines(string[] fileNames)
{
var textFileDictionaries = new List<Dictionary<CCDDetail,List<int>>>();
var reg = new Regex(".{12}(.{17})(.{10})(.{8}).(.{6})(.{22})", RegexOptions.Compiled);
var lineMatches = new List<LineMatch>();
foreach(var f in fileNames)
{
var fileInnerText = File.ReadAllLines(f);
var fileSpecText = new Dictionary<CCDDetail,List<int>>();
for(int j = 1; j < fileInnerText.Length - 4; ++j) // ignore 1st and last 4 lines of file
{
var mat = reg.Match(fileInnerText[j]);
for(int k=1; k<=5; ++k)
{
var key = new CCDDetail() { FieldId = k, Value = mat.Groups[k].Value };
//field and value may occur on multiple lines?
if (fileSpecText.ContainsKey(key) == false)
fileSpecText.Add(key, new List<int>());
fileSpecText[key].Add(j);
}
}
textFileDictionaries.Add(fileSpecText);
}
for(int i=0; i<textFileDictionaries.Count - 2; ++i)
{
for (int j = i+1; j < textFileDictionaries.Count - 1; ++j)
{
foreach(var tup in textFileDictionaries[j])
{
if(textFileDictionaries[i].ContainsKey(tup.Key))
{
// the field value might occure on multiple lines
lineMatches.Add(new LineMatch() {
File1Index=i,
File1Lines = textFileDictionaries[i][tup.Key],
File2Index=j,
File2Lines = textFileDictionaries[j][tup.Key]
});
}
}
/*
for (int k = 0; k < textFileDictionaries[j].Count; ++k)
{
var key = textFileDictionaries[j].Keys.ToArray()[k];
if (textFileDictionaries[i].ContainsKey(key))
{
// the field value might occure on multiple lines
lineMatches.Add(new LineMatch()
{
File1Index = i,
File1Lines = textFileDictionaries[i][key],
File2Index = j,
File2Lines = textFileDictionaries[j][key]
});
}
}
*/
}
}
}
....
public class CCDDetail
{
public int FieldId { get; set; }
public string Value { get; set; }
public override bool Equals(object obj)
{
return FieldId == (obj as CCDDetail).FieldId && Value.Equals((obj as CCDDetail).Value);
}
public override int GetHashCode()
{
return FieldId.GetHashCode() + Value.GetHashCode();
}
}
public class LineMatch
{
public int File1Index { get; set; }
public List<int> File1Lines { get; set; }
public int File2Index { get; set; }
public List<int> File2Lines { get; set; }
}
请记住,我的假设是您可以在要比较的任一文件中的多行上具有相同的字段值。此外,LineMatch列表需要进行后期处理,因为它包含两个具有共同字段的文件的所有行的记录(您可能希望记录哪个字段编号。