Question

背景我有两个或更多的文件需要搜索匹配。这些文件可以轻松拥有超过20,000行。我需要找到最快的方法来搜索它们并找到文件之间的匹配。

我从来没有像这样匹配，我可能会有不止一场比赛，我需要全部归还。

我所知道的：

文件无法与自身匹配。
文件根据一组字段匹配。如果任何字段匹配，则行匹配。
这将经常运行，因此需要尽可能快。

我当前的方法涉及过度使用IEnumerable LINQ方法。

    Dim fileNames As String() = lstFiles.Items.OfType(Of String)().ToArray()
    Dim fileText As IEnumerable(Of IEnumerable(Of CCDDetail)) = fileNames.Select(Function(fileName, fileIndex)
                                                                                     Dim list As New List(Of String)({fileName})
                                                                                     list.AddRange(File.ReadAllLines(fileName))
                                                                                     Return list.Where(Function(fileLine, lineIndex) Not {list.Count - 1, list.Count - 2, 0, 1, 2}.Contains(lineIndex)).
                                                                                         Select(Function(fileLine) New CCDDetail(list(0), fileLine.Substring(12, 17).Trim(), fileLine.Substring(29, 10).Trim(), fileLine.Substring(39, 8).Trim(), fileLine.Substring(48, 6).Trim(), fileLine.Substring(54, 22).Trim()))
                                                                                 End Function)
    Dim asdf = fileText.
        Select(Function(file, inx) file.
                        Select(Function(fileLine, ix) fileText.
                                   Skip(inx + 1).
                                   Select(Function(fileToSearch) fileLine.MatchesAny(ix, fileToSearch)).
                                   Aggregate(New List(Of Integer)(), Function(old, cc)
                                                                         Dim lcc As New List(Of Integer)(cc)
                                                                         lcc.Insert(0, If(old.Count > 0, old(0) + 1, 1))
                                                                         old.AddRange(lcc)
                                                                         Return old
                                                                     End Function)))

CCDDetail中的函数：

Public Function Matches(ccd2 As CCDDetail) As Boolean
    Return CustomerName = ccd2.CustomerName OrElse
            DfiAccountNumber = ccd2.DfiAccountNumber OrElse
            CustomerRefId = ccd2.CustomerRefId OrElse
            PaymentAmount = ccd2.PaymentAmount OrElse
            PaymentId = ccd2.PaymentId
End Function

Public Function MatchesAny(index As Integer, ccd2 As IEnumerable(Of CCDDetail)) As IEnumerable(Of Integer)
    Return Enumerable.Range(0, ccd2.Count).Where(Function(i) ccd2(i).Matches(Me))
End Function

这适用于我的测试文件，但是使用全长文件时大约需要7分钟。

问题：

LINQ是否会让事情变得太慢？我应该写自己的循环吗？
我应该使用正则表达式而不是使用子串吗？

有更快的方法吗？任何表现提示？

感谢。

更新：

我只是使用词典和正则表达式列表减少了很多。我将完成该应用程序，然后进行一些比较变化的测试。

    Dim fileNames As String() = lstFiles.Items.OfType(Of String)().ToArray()
    Dim textFiles As New List(Of Dictionary(Of Integer, CCDDetail))()
    Dim fileInnerText As String()
    Dim reg As Regex = New Regex(".{12}(.{17})(.{10})(.{8}).(.{6})(.{22})", RegexOptions.Compiled)
    Dim mat As Match
    Dim fileSpecText As Dictionary(Of Integer, CCDDetail)
    Dim lineMatches As New List(Of Integer())
    For i As Integer = 0 To fileNames.Length - 1
        fileInnerText = File.ReadAllLines(fileNames(i))
        fileSpecText = New Dictionary(Of Integer, CCDDetail)()
        For j As Integer = 2 To fileInnerText.Length - 3
            mat = reg.Match(fileInnerText(j))
            fileSpecText.Add(j, New CCDDetail(mat.Groups(1).Value, mat.Groups(2).Value, mat.Groups(3).Value, mat.Groups(4).Value, mat.Groups(5).Value))
        Next
        textFiles.Add(fileSpecText)
    Next
    For i As Integer = 0 To textFiles.Count - 1
        'Dim source As Dictionary(Of Integer, CCDDetail) = textFiles(i)
        For j As Integer = 2 To textFiles(i).Count - 1 + 2
            For k As Integer = i + 1 To textFiles.Count - 1
                For l As Integer = 2 To textFiles(k).Count - 1 + 2
                    If (textFiles(i)(j).Matches(textFiles(k)(l))) Then
                        lineMatches.Add({i, j, k, l})
                    End If
                Next
            Next
        Next
    Next

Answer 1

请查看我对您问题的评论。以下（未经测试的）示例代码显示了如何使用Dictionary＆lt;＆gt;可能加快速度。它需要你的“更新”并从那里构建，以便你可以按照我的C＃示例（对不起，我不写VB.net）。我们的想法是，使用您的字段作为键来查找所有匹配的行（具有相同字段值的行）会更快。

您的代码（和我的代码）可以进一步改进，不会立即将所有文件加载到内存中，并且在比较两个文件时，您只需要一次加载到字典中。

    public void CompareLines(string[] fileNames)
    {
        var textFileDictionaries = new List<Dictionary<CCDDetail,List<int>>>();
        var reg  = new Regex(".{12}(.{17})(.{10})(.{8}).(.{6})(.{22})", RegexOptions.Compiled);
        var lineMatches = new List<LineMatch>();

        foreach(var f in fileNames)
        {
            var fileInnerText = File.ReadAllLines(f);
            var fileSpecText = new Dictionary<CCDDetail,List<int>>();
            for(int j = 1; j < fileInnerText.Length - 4; ++j) // ignore 1st and last 4 lines of file
            {
                var mat = reg.Match(fileInnerText[j]);
                for(int k=1; k<=5; ++k)
                {
                    var key = new CCDDetail() { FieldId = k, Value = mat.Groups[k].Value };
                    //field and value may occur on multiple lines?
                    if (fileSpecText.ContainsKey(key) == false)
                        fileSpecText.Add(key, new List<int>());
                    fileSpecText[key].Add(j);
                }
            }
            textFileDictionaries.Add(fileSpecText);
        }
        for(int i=0; i<textFileDictionaries.Count - 2; ++i)
        {
            for (int j = i+1; j < textFileDictionaries.Count - 1; ++j)
            {
                foreach(var tup in textFileDictionaries[j])
                {
                    if(textFileDictionaries[i].ContainsKey(tup.Key))
                    {
                        // the field value might occure on multiple lines
                        lineMatches.Add(new LineMatch() { 
                            File1Index=i,
                            File1Lines = textFileDictionaries[i][tup.Key],
                            File2Index=j,
                            File2Lines = textFileDictionaries[j][tup.Key]
                        });
                    }
                }
                /*
                for (int k = 0; k < textFileDictionaries[j].Count; ++k)
                {
                    var key = textFileDictionaries[j].Keys.ToArray()[k];
                    if (textFileDictionaries[i].ContainsKey(key))
                    {
                        // the field value might occure on multiple lines
                        lineMatches.Add(new LineMatch()
                        {
                            File1Index = i,
                            File1Lines = textFileDictionaries[i][key],
                            File2Index = j,
                            File2Lines = textFileDictionaries[j][key]
                        });
                    }
                }
               */
            }
        }
    }

....

public class CCDDetail
{
    public int FieldId { get; set; }
    public string Value { get; set; }

    public override bool Equals(object obj)
    {
        return FieldId == (obj as CCDDetail).FieldId && Value.Equals((obj as CCDDetail).Value);
    }
    public override int GetHashCode()
    {
        return FieldId.GetHashCode() + Value.GetHashCode();
    }
}
public class LineMatch
{
    public int File1Index { get; set; }
    public List<int> File1Lines { get; set; }
    public int File2Index { get; set; }
    public List<int> File2Lines { get; set; }
}

请记住，我的假设是您可以在要比较的任一文件中的多行上具有相同的字段值。此外，LineMatch列表需要进行后期处理，因为它包含两个具有共同字段的文件的所有行的记录（您可能希望记录哪个字段编号。

搜索数组以获取另一个数组中匹配项的最快方法

1 个答案: