我有一个相当大的文本文件(大约10兆字节,700,000行),其中包含HTML代码。
我的目标是从中提取某些信息。我相信使用RegEx将是最好的方法,因为我有多个文件我也需要做同样的事情。
我有,我认为RegEx与我需要的数据相匹配,但我相信我遇到了锚点的问题。我一直在使用regex101.com帮助我匹配和学习RegEx,但我一次只能匹配一部分数据。我试过用\ A,$,^来播放字符串的开头和结尾没有运气。我试过谷歌搜索这个,但我只发现一篇文章似乎与我的用例匹配,它使用的是perl,解决方案是创建整个文本文件的单个字符串,我不相信这是一个好主意。
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title></title>
</head>
<body dir="LTR" bgcolor="#ffffff">
<!-- Created by Oracle Reports 04:00 Fri Aug 15 04:00:37 AM, 2014 -->
<table border=0 cellspacing=0 cellpadding=0 width=774>
<tr><td width=15></td><td width=1></td><td width=3></td><td width=6></td><td width=44></td><td width=1></td><td width=15></td><td width=4></td><td width=17></td><td width=1></td><td width=11></td><td width=1></td><td width=14></td><td width=1></td><td width=11></td><td width=1></td><td width=17></td><td width=11></td><td width=4></td><td width=11></td><td width=2></td><td width=13></td><td width=45></td><td width=1></td><td width=15></td><td width=3></td><td width=9></td><td width=8></td><td width=1></td><td width=11></td><td width=1></td><td width=14></td><td width=1></td><td width=11></td><td width=1></td><td width=17></td><td width=12></td><td width=17></td><td width=12></td><td width=45></td><td width=1></td><td width=9></td><td width=6></td><td width=4></td><td width=16></td><td width=1></td><td width=11></td><td width=1></td><td width=13></td><td width=1></td><td width=1></td><td width=11></td><td width=1></td><td width=17></td><td width=12></td><td width=17></td><td width=13></td><td width=36></td><td width=8></td><td width=1></td><td width=15></td><td width=4></td><td width=17></td><td width=1></td><td width=11></td><td width=1></td><td width=14></td><td width=1></td><td width=11></td><td width=1></td><td width=17></td><td width=12></td><td width=17></td><td width=8></td><td width=1></td><td width=10></td><td width=25></td></tr>
<tr><td colspan=77 height=9></td></tr>
<tr valign=top>
<td height=9></td>
<td colspan=23></td>
<td colspan=2></td>
</tr>
<tr><td colspan=77 height=9></td></tr>
<tr valign=top>
<td height=9></td>
<td width=174 colspan=19 rowspan=2><font face="helvetica" color="#007f7f"><b>15-AUG-2014</b></font></td>
<td colspan=38></td>
<td width=139 colspan=16 rowspan=2 align=center> <font face="helvetica" color="#007f7f"><b>Page </b></font><font face="helvetica" color="#007f7f"><b>1</b></font><font face="helvetica" color="#007f7f"><b> of </b></font><font face="helvetica" color="#007f7f"><b>58</b></font><br></td>
<td colspan=3></td>
</tr>
<tr valign=top>
<td height=9></td>
<td colspan=38></td>
<td colspan=3></td>
</tr>
<tr valign=top>
<td height=9 colspan=3></td>
<td></td>
</tr>
<tr valign=top>
<td height=9 colspan=3></td>
<td></td>
</tr>
<tr><td colspan=77 height=9></td></tr>
<tr valign=top>
<td height=9 colspan=2></td>
<td></td>
</tr>
<tr valign=top>
<td height=9 colspan=27></td>
<td colspan=28></td>
</tr>
<tr valign=top>
<td height=9 colspan=4></td>
<td width=44><font size=2 face="helvetica">08/14/14</font></td>
<td></td>
<td width=15 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica">5</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 align=right><font size=2 face="helvetica">7</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica">3</font></td>
<td></td>
<td width=17 colspan=3 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<td width=45><font size=2 face="helvetica">07/19/14</font></td>
<td></td>
<td width=15 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<td width=17 colspan=2 align=right><font size=2 face="helvetica">9</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 align=right><font size=2 face="helvetica">2</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica">4</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<td width=45><font size=2 face="helvetica">06/23/14</font></td>
<td></td>
<td width=15 colspan=2 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<td width=16 align=right><font size=2 face="helvetica">0</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 colspan=2 align=right><font size=2 face="helvetica">5</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica">6</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<td width=44 colspan=2><font size=2 face="helvetica">05/28/14</font></td>
<td></td>
<td width=15 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica">5</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 align=right><font size=2 face="helvetica">3</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica">1</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica"> </font></td>
<td colspan=4></td>
</tr>
<tr><td colspan=77 height=1></td></tr>
<tr valign=top>
<td height=9 colspan=4></td>
<td width=44 rowspan=2><font size=2 face="helvetica">08/14/14</font></td>
<td></td>
<td width=15 rowspan=2 align=right><font size=2 face="helvetica"> M</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">4</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 rowspan=2 align=right><font size=2 face="helvetica">3</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
<td></td>
<td width=17 colspan=3 rowspan=2 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<td width=45 rowspan=2><font size=2 face="helvetica">07/19/14</font></td>
<td></td>
<td width=15 rowspan=2 align=right><font size=2 face="helvetica"> M</font></td>
<td></td>
<td width=17 colspan=2 rowspan=2 align=right><font size=2 face="helvetica">5</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 rowspan=2 align=right><font size=2 face="helvetica">6</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">5</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<td width=45 rowspan=2><font size=2 face="helvetica">06/23/14</font></td>
<td></td>
<td width=15 colspan=2 rowspan=2 align=right><font size=2 face="helvetica"> M</font></td>
<td></td>
<td width=16 rowspan=2 align=right><font size=2 face="helvetica">7</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 colspan=2 rowspan=2 align=right><font size=2 face="helvetica">8</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">6</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<td width=44 colspan=2 rowspan=2><font size=2 face="helvetica">05/28/14</font></td>
<td></td>
<td width=15 rowspan=2 align=right><font size=2 face="helvetica"> M</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">2</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">6</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica"> </font></td>
<td colspan=4></td>
</tr>
<tr valign=top>
<td height=9 colspan=4></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td colspan=4></td>
</tr>
<tr><td colspan=77 height=1></td></tr>
<tr valign=top>
<td height=9 colspan=4></td>
<td width=44 rowspan=2><font size=2 face="helvetica">08/13/14</font></td>
<td></td>
<td width=15 rowspan=2 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">8</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">6</font></td>
<td></td>
<td width=17 colspan=3 rowspan=2 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<td width=45 rowspan=2><font size=2 face="helvetica">07/18/14</font></td>
<td></td>
<td width=15 rowspan=2 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<td width=17 colspan=2 rowspan=2 align=right><font size=2 face="helvetica">0</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 rowspan=2 align=right><font size=2 face="helvetica">4</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">3</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<td width=45 rowspan=2><font size=2 face="helvetica">06/22/14</font></td>
<td></td>
<td width=15 colspan=2 rowspan=2 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<td width=16 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 colspan=2 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<td width=44 colspan=2 rowspan=2><font size=2 face="helvetica">05/27/14</font></td>
<td></td>
<td width=15 rowspan=2 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">4</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 rowspan=2 align=right><font size=2 face="helvetica">5</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">2</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica"> </font></td>
<td colspan=4></td>
</tr>
<tr valign=top>
<td height=9 colspan=4></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td colspan=4></td>
</tr>
使用全局和多线修改器
\s*<td width=\d* rowspan=\d*><font size=\d face="helvetica">(?<Date>\d+.\d+.\d+)<.font><.td>
\s*<td width=\d* rowspan=\d* align=right><font size=\d* face="helvetica"> (?<Time>E|M)<.font><.td>
\s*<td width=\d* colspan=\d* rowspan=\d* align=right><font size=\d* face="helvetica">(?<FirstNum>\d)<.font><.td>
\s*<td width=\d* rowspan=\d* align=right><font size=\d* face="helvetica">-<.font><.td>
\s*<td width=\d* rowspan=\d* align=right><font size=\d* face="helvetica">(?<SecondNum>\d)<.font><.td>
\s*<td width=\d* rowspan=\d* align=right><font size=\d* face="helvetica">-<.font><.td>
\s*<td width=\d* rowspan=\d* align=right><font size=\d* face="helvetica">(?<ThirdNum>\d)<.font><.td>
static void Main(string[] args)
{
string filePathDirty = @"DataBase/InputFile.htm";
string filePathClean = @"DataBase/InputFile-CLEAN.htm";
int totalLines = File.ReadAllLines(filePathDirty).Length;
try
{
string[] lines = File.ReadAllLines(filePathDirty);
string cleanLine;
int progress = 0;
string pattern = String.Empty;
// Group Name: Date
pattern += @"\s*<td width=\d* rowspan=\d*><font size=\d face=""helvetica"">(?<Date>\d+.\d+.\d+)<.font><.td>";
// Group Name: Time
pattern += @"\s*<td width=\d* rowspan=\d* align=right><font size=\d* face=""helvetica""> (?<Time>E|M)<.font><.td>";
// Group Name: FirstNumber
pattern += @"\s*<td width=\d* colspan=\d* rowspan=\d* align=right><font size=\d* face=""helvetica"">(?<FirstNum>\d)<.font><.td>";
pattern += @"\s*<td width=\d* rowspan=\d* align=right><font size=\d* face=""helvetica"">-<.font><.td>";
// Group Name: SecondNumber
pattern += @"\s*<td width=\d* rowspan=\d* align=right><font size=\d* face=""helvetica"">(?<SecondNum>\d)<.font><.td>";
pattern += @"\s*<td width=\d* rowspan=\d* align=right><font size=\d* face=""helvetica"">-<.font><.td>";
// Group Name: ThirdNumber
pattern += @"\s*<td width=\d* rowspan=\d* align=right><font size=\d* face=""helvetica"">(?<ThirdNum>\d)<.font><.td>";
foreach (string line in lines)
{
// Skip the First 69 Lines, No Need to Since there is no Data
if (progress > 69)
{
foreach (Match match in Regex.Matches(line, pattern))
{
cleanLine = String.Format("{0} | {1} | {2} | {3} | {4}\r\n", match.Groups["Date"].Value, match.Groups["Time"].Value, match.Groups["FirstNum"].Value, match.Groups["SecondNum"].Value, match.Groups["ThirdNum"].Value);
WriteToFile(cleanLine, filePathClean);
}
}
progress++;
}
}
catch (Exception e)
{
Console.WriteLine("The file could not be read:");
Console.WriteLine(e.Message);
}
}
在HTML中,需要提取一小部分数据。我已评论过帮助确定数据的位置以及格式的格式。
<!-- Start Matching -->
<tr valign=top>
<td height=9 colspan=4></td>
<!-- Line Below Has the Date // 08/14/14 -->
<td width=44><font size=2 face="helvetica">08/14/14</font></td>
<td></td>
<!-- Line Below Has the Time // E -->
<!-- Will Either be a Capital E or M for Evening or Morning -->
<td width=15 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<!-- Line Below Has the First Number // 5 -->
<td width=17 align=right><font size=2 face="helvetica">5</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<!-- Line Below Has the Second Number // 7 -->
<td width=14 align=right><font size=2 face="helvetica">7</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<!-- Line Below Has the Third Number // 3 -->
<td width=17 align=right><font size=2 face="helvetica">3</font></td>
<td></td>
<td width=17 colspan=3 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<!-- End of Matching // There are Three Sets of Data per HTML Table Row -->
<td width=45><font size=2 face="helvetica">07/19/14</font></td>
<td></td>
<td width=15 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<td width=17 colspan=2 align=right><font size=2 face="helvetica">9</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 align=right><font size=2 face="helvetica">2</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica">4</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<td width=45><font size=2 face="helvetica">06/23/14</font></td>
<td></td>
<td width=15 colspan=2 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<td width=16 align=right><font size=2 face="helvetica">0</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 colspan=2 align=right><font size=2 face="helvetica">5</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica">6</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<td width=44 colspan=2><font size=2 face="helvetica">05/28/14</font></td>
<td></td>
<td width=15 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica">5</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 align=right><font size=2 face="helvetica">3</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica">1</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica"> </font></td>
<td colspan=4></td>
</tr>
我想将这些集合分组为以下格式创建一个新的平面文件,以便干净地导入数据库。
日期|时间| NumberOne | NumberTwo | NumberThree
答案 0 :(得分:1)
考虑另一种方法..