Question

我有数千个.log文件，我需要在所有文件中找到一些字符串。

我将用例子解释：在所有.log文件中我都有字符串calles＆＃34; AAA＆＃34;在那个字符串之后，我有一个可以从一个日志文件到另一个日志文件的diffrenet。我知道如何搜索AAA字符串。我不知道的是如何只裁剪AAA字符串后面的字符串数字。

更新 .log文件包含很多行。在.log文件中，我只有一行包含字符串＆＃34; A12A＆＃34;。在那行之后我有一个数字（例如：5465）。我需要的是在A12A之后提取数字。注意：A12A与5465字符串编号之间有一个间距。

示例： .log文件：＆＃34; assddsf dfdfsd dfd A12A 5465 dffdsfsdf dfdf dfdf＆＃34; 我需要提取的内容：5465。

到目前为止我所拥有的是：

// Modify this path as necessary.
string startFolder = @"c:\program files\Microsoft Visual Studio 9.0\";

// Take a snapshot of the file system.
System.IO.DirectoryInfo dir = new System.IO.DirectoryInfo(startFolder);

// This method assumes that the application has discovery permissions
// for all folders under the specified path.
IEnumerable<System.IO.FileInfo> fileList = dir.GetFiles("*.*", System.IO.SearchOption.AllDirectories);

string searchTerm = @"Visual Studio";

// Search the contents of each file.
// A regular expression created with the RegEx class
// could be used instead of the Contains method.
// queryMatchingFiles is an IEnumerable<string>.
var queryMatchingFiles =
    from file in fileList
    where file.Extension == ".htm"
    let fileText = GetFileText(file.FullName)
    where fileText.Contains(searchTerm)
    select file.FullName;

// Execute the query.
Console.WriteLine("The term \"{0}\" was found in:", searchTerm);
foreach (string filename in queryMatchingFiles)
{
    Console.WriteLine(filename);
}

// Keep the console window open in debug mode.
Console.WriteLine("Press any key to exit");
Console.ReadKey();
}

// Read the contents of the file.
static string GetFileText(string name)
{
    string fileContents = String.Empty;

// If the file has been deleted since we took 
// the snapshot, ignore it and return the empty string.
if (System.IO.File.Exists(name))
{
    fileContents = System.IO.File.ReadAllText(name);
}
return fileContents;

}

Answer 1

我建议使用以下代码进行搜索：

private static readonly string _SearchPattern = "A12A";
private static readonly Regex _NumberExtractor = new Regex(@"\d+");

private static IEnumerable<Tuple<String, int>> FindMatches()
{
    var startFolder = @"D:\";
    var filePattern = @"*.htm";
    var matchingFiles = Directory.EnumerateFiles(startFolder, filePattern, SearchOption.AllDirectories);

    foreach (var file in matchingFiles)
    {
        // What encoding do your files use?
        var lines = File.ReadLines(file, Encoding.UTF8);

        foreach (var line in lines)
        {
            int number;

            if (TryGetNumber(line, out number))
            {
                yield return Tuple.Create(file, number);

                // Stop searching that file and continue with the next one.
                break;
            }
        }
    }
}

private static bool TryGetNumber(string line, out int number)
{
    number = 0;

    // Should casing be ignored??
    var index = line.IndexOf(_SearchPattern, StringComparison.InvariantCultureIgnoreCase);

    if (index >= 0)
    {
        var numberRaw = line.Substring(index + _SearchPattern.Length);
        var match = _NumberExtractor.Match(numberRaw);
        return Int32.TryParse(match.Value, out number);
    }

    return false;
}

原因是在进行I / O操作时，驱动器本身通常是瓶颈。因此，并行执行任何操作或在不使用文件的情况下将大量数据从文件读入内存是没有意义的。

通过使用Directory.EnumerateFiles方法，将懒惰地搜索驱动器，您可以在找到后立即开始检查第一个文件。 File.ReadLines方法也是如此。在您搜索模式时，它会懒洋洋地遍历文件。

通过这种方法，您应该获得最大速度（取决于您的硬盘驱动器性能），因为它可以实现将文件和内容传输到代码所需的最少I / O调用。

如何从多个txt文件中裁剪字符串？

1 个答案: