你如何从R中的html格式文件中提取一些数据

时间:2015-01-25 21:31:28

标签: regex r

我有一个html文件,我想从结果中提取一些数据并构建一个向量:

我的html文件如下所示:

数据

 "<HTML>\r\n<HEAD>\r\n<meta http-equiv=\"Expires\" content=\"0\"/>\n<meta http-equiv=\"Pragma\" content=\"no-cache\"/>\n\r\n<META HTTP-EQUIV=\"Content-Type\" CONTENT=\"text/html;CHARSET=Cp1252\"/>\r\n\r\n<TITLE>file</TITLE>\r\n<LINK REL=\"stylesheet\" TYPE=\"text/css\" HREF=\"/SiteScope/htdocs/artwork/sitescopeUI.css\"/>\r\n</HEAD>\n\r\n<BODY BGCOLOR=\"#ffffff\" LINK=#1155bb ALINK=#1155bb VLINK=#1155bb>\n\r\n<H2></H2><p><p>\r\n<A HREF=/SiteScope/accounts/login59/htdocs/Reports-1725002550/latest.html><B>Most Recent Report</B></A>\r\n<P><CENTER>\n<A NAME=uptimeSummary> </A>\n<TABLE WIDTH=\"100%\" BORDER=1 CELLSPACING=0>\n <CAPTION><B>Report Summary</B></CAPTION>\r\n <TR BGCOLOR=\"#88AA99\"><TH>&nbsp;</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag1</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag10</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag11</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag12</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag13</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag14</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag15</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag16</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag2</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on server1</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag4</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on server2</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag6</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on server3</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag8</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on server9</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on dc2prwtag17</TH><TH COLSPAN=2>WTAD::Linux: Total CPU Percent Utilization on server10</TH></TR>\r\n <TR BGCOLOR=\"#DDDDDD\"><TD><B>Information For</B></TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD><TD ALIGN=RIGHT>avg</TD><TD ALIGN=RIGHT>peak</TD></TR>\r\n <TR BGCOLOR=\"#DDDDDD\"><TD><A HREF=/SiteScope/accounts/login59/htdocs/Reports-1725002550/Report-15_33-01_25_2015.html>3:33 PM 1/18/15 - 3:33 PM 1/25/15</A> (<A HREF=/SiteScope/accounts/login59/htdocs/Reports-1725002550/Report-15_33-01_25_2015.txt>text</A>)</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">3.67%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">28%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">2.85%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">10%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">1.65%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">18%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.54%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">14%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.12%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">15%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.42%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">18%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">2%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.72%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">6%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.26%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">30%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">3.42%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">16%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">3.4%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">16%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">1.58%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">7%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.46%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">7%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.4%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">2%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">1.49%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">11%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">2.25%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">8%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.49%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4%</TD></TR>\r\n <TR BGCOLOR=\"#DDDDDD\"><TD><A HREF=/SiteScope/accounts/login59/htdocs/Reports-1725002550/Report-01_05-01_25_2015.html>1:05 AM 1/18/15 - 1:05 AM 1/25/15</A> (<A HREF=/SiteScope/accounts/login59/htdocs/Reports-1725002550/Report-01_05-01_25_2015.txt>text</A>)</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">3.68%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">28%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">2.75%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">10%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">1.6%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">18%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.41%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">14%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.11%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">15%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.39%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">18%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">2%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.72%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">6%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.25%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">30%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">3.43%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">16%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">3.39%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">16%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">1.58%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">7%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.46%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">7%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.4%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">2%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">1.49%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">11%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">2.17%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">8%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.55%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4%</TD></TR>\r\n <TR BGCOLOR=\"#DDDDDD\"><TD><A HREF=/SiteScope/accounts/login59/htdocs/Reports-1725002550/Report-11_26-01_20_2015.html>11:26 AM 1/13/15 - 11:26 AM 1/20/15</A> (<A HREF=/SiteScope/accounts/login59/htdocs/Reports-1725002550/Report-11_26-01_20_2015.txt>text</A>)</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">3.83%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">27%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">2.74%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">15%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">1.51%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">6%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.64%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">20%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.32%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">21%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.84%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">20%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">1%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.72%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">7%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4.39%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">27%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">3.49%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">16%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">3.45%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">16%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">1.65%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">7%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.51%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">6%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.42%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">6%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">1.55%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">7%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">2.11%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">8%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">1%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">0.61%</TD><TD ALIGN=RIGHT BGCOLOR=\"#FFFFFF\">4%</TD></TR>\r\n</TABLE></CENTER>\r\n<P><FORM ACTION=\"/SiteScope/cgi/go.exe/SiteScope\" method=\"POST\">\n<input type=\"hidden\" name=\"page\" value=\"adhocReport\"/>\n<input type=\"hidden\" name=\"queryID\" value=\"1725002550\"/>\n<input type=\"hidden\" name=\"htmlFile\" value=\"yes\"/>\n<input type=\"hidden\" name=\"account\" value=\"login59\"/>\n<input type=\"hidden\" name=\"isFlipperContext\" value=\"false\"/>\n<input type=\"hidden\" name=\"isSwingContext\" value=\"true\"/>\n<input type=\"hidden\" name=\"locale\" value=\"en_US\"/>\n<input type=\"hidden\" name=\"useOldLinks\" value=\"false\"/>\n<input class=\"button\" type=\"submit\" value=\"Generate\" onclick=\"this.disabled=true; this.value= 'Generating. Wait..'; document.forms[0].submit();\" />\n</FORM>\nManagement Report Now - this will immediately generate and save this report, using the most current data\n (<B>Note: </B>This may take a few moments, depending on the speed of the SiteScope machine, the number of monitors and the time period of the report)\n</BODY></HTML>\r\n"

我需要grep以HREF开头的行,以&gt;结尾。例如,

我需要将所有这些条目放入向量中:

HREF=/SiteScope/accounts/login59/htdocs/Reports-1725002550/Report-15_33-01_25_2015.txt
HREF=/SiteScope/accounts/login59/htdocs/Reports-1725002550/Report-11_26-01_20_2015.txt
HREF=/SiteScope/accounts/login59/htdocs/Reports-1725002550/Report-11_26-01_20_2015.txt

并对矢量进行排序并列出矢量中的最新内容。

我试过这个:

vec<-as.vector()
vec<-append(grepl("(HREF.*>?"),data, value=TRUE)

没有运气,我对这方面的指导表示感谢吗?

1 个答案:

答案 0 :(得分:1)

我建议你永远不要在html上使用grep。相反,您可以使用XML包进行html解析和搜索。我无法复制你的html文档,但既然你在评论中提到它有效,我会在这里发表我的评论。从您的data文字

开始
library(XML)
doc <- htmlParse(data, asText=TRUE, useInternal=TRUE)
x <- xpathSApply(doc, "//*[@href]", xmlAttrs)

现在我们有了所有的href链接。从那里你可以使用

grep("[.]txt$", x, value=TRUE)

仅获取以.txt

结尾的内容