RegEx匹配大文本文件中的多个字符串

时间:2014-08-16 18:04:53

标签: c# .net regex

问题

我有一个相当大的文本文件(大约10兆字节,700,000行),其中包含HTML代码。

我的目标是从中提取某些信息。我相信使用RegEx将是最好的方法,因为我有多个文件我也需要做同样的事情。

我有,我认为RegEx与我需要的数据相匹配,但我相信我遇到了锚点的问题。我一直在使用regex101.com帮助我匹配和学习RegEx,但我一次只能匹配一部分数据。我试过用\ A,$,^来播放字符串的开头和结尾没有运气。我试过谷歌搜索这个,但我只发现一篇文章似乎与我的用例匹配,它使用的是perl,解决方案是创建整个文本文件的单个字符串,我不相信这是一个好主意。

示例输入文件

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type"  content="text/html; charset=ISO-8859-1">
<title></title>
</head>
<body dir="LTR" bgcolor="#ffffff">
<!-- Created by Oracle Reports 04:00 Fri Aug 15 04:00:37 AM, 2014 -->

<table border=0 cellspacing=0 cellpadding=0 width=774>
<tr><td width=15></td><td width=1></td><td width=3></td><td width=6></td><td width=44></td><td width=1></td><td width=15></td><td width=4></td><td width=17></td><td width=1></td><td width=11></td><td width=1></td><td width=14></td><td width=1></td><td width=11></td><td width=1></td><td width=17></td><td width=11></td><td width=4></td><td width=11></td><td width=2></td><td width=13></td><td width=45></td><td width=1></td><td width=15></td><td width=3></td><td width=9></td><td width=8></td><td width=1></td><td width=11></td><td width=1></td><td width=14></td><td width=1></td><td width=11></td><td width=1></td><td width=17></td><td width=12></td><td width=17></td><td width=12></td><td width=45></td><td width=1></td><td width=9></td><td width=6></td><td width=4></td><td width=16></td><td width=1></td><td width=11></td><td width=1></td><td width=13></td><td width=1></td><td width=1></td><td width=11></td><td width=1></td><td width=17></td><td width=12></td><td width=17></td><td width=13></td><td width=36></td><td width=8></td><td width=1></td><td width=15></td><td width=4></td><td width=17></td><td width=1></td><td width=11></td><td width=1></td><td width=14></td><td width=1></td><td width=11></td><td width=1></td><td width=17></td><td width=12></td><td width=17></td><td width=8></td><td width=1></td><td width=10></td><td width=25></td></tr>
<tr><td colspan=77 height=9></td></tr>
<tr valign=top>
  <td height=9></td>
  <td colspan=23></td>
  <td colspan=2></td>
</tr>
<tr><td colspan=77 height=9></td></tr>
<tr valign=top>
  <td height=9></td>
  <td width=174 colspan=19 rowspan=2><font face="helvetica" color="#007f7f"><b>15-AUG-2014</b></font></td>
  <td colspan=38></td>
  <td width=139 colspan=16 rowspan=2 align=center> <font face="helvetica" color="#007f7f"><b>Page&nbsp;</b></font><font face="helvetica" color="#007f7f"><b>1</b></font><font face="helvetica" color="#007f7f"><b>&nbsp;of&nbsp;</b></font><font face="helvetica" color="#007f7f"><b>58</b></font><br></td>
  <td colspan=3></td>
</tr>
<tr valign=top>
  <td height=9></td>
  <td colspan=38></td>
  <td colspan=3></td>
</tr>
<tr valign=top>
  <td height=9 colspan=3></td>
  <td></td>
</tr>
<tr valign=top>
  <td height=9 colspan=3></td>
  <td></td>
</tr>
<tr><td colspan=77 height=9></td></tr>
<tr valign=top>
  <td height=9 colspan=2></td>
  <td></td>
</tr>
<tr valign=top>
  <td height=9 colspan=27></td>
  <td colspan=28></td>
</tr>
<tr valign=top>
  <td height=9 colspan=4></td>
  <td width=44><font size=2 face="helvetica">08/14/14</font></td>
  <td></td>
  <td width=15 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">5</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 align=right><font size=2 face="helvetica">7</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">3</font></td>
  <td></td>
  <td width=17 colspan=3 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>
  <td width=45><font size=2 face="helvetica">07/19/14</font></td>
  <td></td>
  <td width=15 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>
  <td width=17 colspan=2 align=right><font size=2 face="helvetica">9</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 align=right><font size=2 face="helvetica">2</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">4</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>
  <td width=45><font size=2 face="helvetica">06/23/14</font></td>
  <td></td>
  <td width=15 colspan=2 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>
  <td width=16 align=right><font size=2 face="helvetica">0</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 colspan=2 align=right><font size=2 face="helvetica">5</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">6</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>
  <td width=44 colspan=2><font size=2 face="helvetica">05/28/14</font></td>
  <td></td>
  <td width=15 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">5</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 align=right><font size=2 face="helvetica">3</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">1</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td colspan=4></td>
</tr>
<tr><td colspan=77 height=1></td></tr>
<tr valign=top>
  <td height=9 colspan=4></td>
  <td width=44 rowspan=2><font size=2 face="helvetica">08/14/14</font></td>
  <td></td>
  <td width=15 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;M</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">4</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 rowspan=2 align=right><font size=2 face="helvetica">3</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
  <td></td>
  <td width=17 colspan=3 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>
  <td width=45 rowspan=2><font size=2 face="helvetica">07/19/14</font></td>
  <td></td>
  <td width=15 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;M</font></td>
  <td></td>
  <td width=17 colspan=2 rowspan=2 align=right><font size=2 face="helvetica">5</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 rowspan=2 align=right><font size=2 face="helvetica">6</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">5</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>
  <td width=45 rowspan=2><font size=2 face="helvetica">06/23/14</font></td>
  <td></td>
  <td width=15 colspan=2 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;M</font></td>
  <td></td>
  <td width=16 rowspan=2 align=right><font size=2 face="helvetica">7</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 colspan=2 rowspan=2 align=right><font size=2 face="helvetica">8</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">6</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>
  <td width=44 colspan=2 rowspan=2><font size=2 face="helvetica">05/28/14</font></td>
  <td></td>
  <td width=15 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;M</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">2</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">6</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td colspan=4></td>
</tr>
<tr valign=top>
  <td height=9 colspan=4></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td colspan=4></td>
</tr>
<tr><td colspan=77 height=1></td></tr>
<tr valign=top>
  <td height=9 colspan=4></td>
  <td width=44 rowspan=2><font size=2 face="helvetica">08/13/14</font></td>
  <td></td>
  <td width=15 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">8</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">6</font></td>
  <td></td>
  <td width=17 colspan=3 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>
  <td width=45 rowspan=2><font size=2 face="helvetica">07/18/14</font></td>
  <td></td>
  <td width=15 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>
  <td width=17 colspan=2 rowspan=2 align=right><font size=2 face="helvetica">0</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 rowspan=2 align=right><font size=2 face="helvetica">4</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">3</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>
  <td width=45 rowspan=2><font size=2 face="helvetica">06/22/14</font></td>
  <td></td>
  <td width=15 colspan=2 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>
  <td width=16 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 colspan=2 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>
  <td width=44 colspan=2 rowspan=2><font size=2 face="helvetica">05/27/14</font></td>
  <td></td>
  <td width=15 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">4</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 rowspan=2 align=right><font size=2 face="helvetica">5</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">2</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td colspan=4></td>
</tr>
<tr valign=top>
  <td height=9 colspan=4></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td colspan=4></td>
</tr>

正则表达式

使用全局和多线修改器

\s*<td width=\d* rowspan=\d*><font size=\d face="helvetica">(?<Date>\d+.\d+.\d+)<.font><.td>
\s*<td width=\d* rowspan=\d* align=right><font size=\d* face="helvetica">&nbsp;(?<Time>E|M)<.font><.td>
\s*<td width=\d* colspan=\d* rowspan=\d* align=right><font size=\d* face="helvetica">(?<FirstNum>\d)<.font><.td>
\s*<td width=\d* rowspan=\d* align=right><font size=\d* face="helvetica">-<.font><.td>
\s*<td width=\d* rowspan=\d* align=right><font size=\d* face="helvetica">(?<SecondNum>\d)<.font><.td>
\s*<td width=\d* rowspan=\d* align=right><font size=\d* face="helvetica">-<.font><.td>
\s*<td width=\d* rowspan=\d* align=right><font size=\d* face="helvetica">(?<ThirdNum>\d)<.font><.td>

C#Source

static void Main(string[] args)
{

    string filePathDirty = @"DataBase/InputFile.htm";
    string filePathClean = @"DataBase/InputFile-CLEAN.htm";

    int totalLines = File.ReadAllLines(filePathDirty).Length;

    try
    {

        string[] lines = File.ReadAllLines(filePathDirty);
        string cleanLine;

        int progress = 0;

        string pattern = String.Empty;

            // Group Name: Date
            pattern += @"\s*<td width=\d* rowspan=\d*><font size=\d face=""helvetica"">(?<Date>\d+.\d+.\d+)<.font><.td>";
            // Group Name: Time
            pattern += @"\s*<td width=\d* rowspan=\d* align=right><font size=\d* face=""helvetica"">&nbsp;(?<Time>E|M)<.font><.td>";
            // Group Name: FirstNumber
            pattern += @"\s*<td width=\d* colspan=\d* rowspan=\d* align=right><font size=\d* face=""helvetica"">(?<FirstNum>\d)<.font><.td>";
            pattern += @"\s*<td width=\d* rowspan=\d* align=right><font size=\d* face=""helvetica"">-<.font><.td>";
            // Group Name: SecondNumber
            pattern += @"\s*<td width=\d* rowspan=\d* align=right><font size=\d* face=""helvetica"">(?<SecondNum>\d)<.font><.td>";
            pattern += @"\s*<td width=\d* rowspan=\d* align=right><font size=\d* face=""helvetica"">-<.font><.td>";
            // Group Name: ThirdNumber
            pattern += @"\s*<td width=\d* rowspan=\d* align=right><font size=\d* face=""helvetica"">(?<ThirdNum>\d)<.font><.td>";

        foreach (string line in lines)
        {
            // Skip the First 69 Lines, No Need to Since there is no Data
            if (progress > 69)
            {

                foreach (Match match in Regex.Matches(line, pattern))
                {
                        cleanLine = String.Format("{0} | {1} | {2} | {3} | {4}\r\n", match.Groups["Date"].Value, match.Groups["Time"].Value, match.Groups["FirstNum"].Value, match.Groups["SecondNum"].Value, match.Groups["ThirdNum"].Value);
                        WriteToFile(cleanLine, filePathClean);
                }

            }

            progress++;

        }

    }
    catch (Exception e)
    {
        Console.WriteLine("The file could not be read:");
        Console.WriteLine(e.Message);
    }

}

简化规格

在HTML中,需要提取一小部分数据。我已评论过帮助确定数据的位置以及格式的格式。

<!-- Start Matching -->

<tr valign=top>
  <td height=9 colspan=4></td>

<!-- Line Below Has the Date // 08/14/14 -->

  <td width=44><font size=2 face="helvetica">08/14/14</font></td>
  <td></td>

<!-- Line Below Has the Time // E -->
<!-- Will Either be a Capital E or M for Evening or Morning -->

  <td width=15 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>

<!-- Line Below Has the First Number // 5 -->

  <td width=17 align=right><font size=2 face="helvetica">5</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>

<!-- Line Below Has the Second Number // 7 -->

  <td width=14 align=right><font size=2 face="helvetica">7</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>

<!-- Line Below Has the Third Number // 3 -->

  <td width=17 align=right><font size=2 face="helvetica">3</font></td>
  <td></td>
  <td width=17 colspan=3 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>

<!-- End of Matching // There are Three Sets of Data per HTML Table Row -->

  <td width=45><font size=2 face="helvetica">07/19/14</font></td>
  <td></td>
  <td width=15 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>
  <td width=17 colspan=2 align=right><font size=2 face="helvetica">9</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 align=right><font size=2 face="helvetica">2</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">4</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>
  <td width=45><font size=2 face="helvetica">06/23/14</font></td>
  <td></td>
  <td width=15 colspan=2 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>
  <td width=16 align=right><font size=2 face="helvetica">0</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 colspan=2 align=right><font size=2 face="helvetica">5</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">6</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>
  <td width=44 colspan=2><font size=2 face="helvetica">05/28/14</font></td>
  <td></td>
  <td width=15 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">5</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 align=right><font size=2 face="helvetica">3</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">1</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td colspan=4></td>
</tr>

我想将这些集合分组为以下格式创建一个新的平面文件,以便干净地导入数据库。

日期|时间| NumberOne | NumberTwo | NumberThree

1 个答案:

答案 0 :(得分:1)

考虑另一种方法..

  1. 首先将HTML文档/ HTML TABLE转换为XML(我们可以获得免费工具/代码来执行此操作)
  2. 编写您自己的XQuery / XML解析代码以获取您想要的详细信息并完成剩下的工作。 希望这会有所帮助..