使用R中的xpath定位LexisNexis元数据

时间:2016-01-07 13:42:42

标签: html regex xml r xpath

我不是很熟悉使用xpath或regex导航xml / html,并且使用LexisNexis的一组html文档,格式如下:

<HTML>
    <HEAD>
        <STYLE TYPE="text/css"><!--
        .c0 { text-align: center; }
        .c1 { text-align: center; margin-top: 0em; margin-bottom: 0em; }
        .c2 { font-family: 'Times New Roman'; font-size: 10pt; font-style: normal; font-weight: normal; color: #000000; text-decoration: none; }
        .c3 { text-align: center; margin-left: 13%; margin-right: 13%; }
        .c4 { text-align: left; }
        .c5 { text-align: left; margin-top: 0em; margin-bottom: 0em; }
        .c6 { font-family: 'Times New Roman'; font-size: 14pt; font-style: normal; font-weight: bold; color: #000000; text-decoration: none; }
        .c7 { font-family: 'Times New Roman'; font-size: 10pt; font-style: normal; font-weight: bold; color: #000000; text-decoration: none; }
        .c8 { text-align: left; margin-top: 1em; margin-bottom: 0em; }
        .c9 { page-break-before: always; }
        .c10 { font-family: 'Times New Roman'; font-size: 10pt; font-style: italic; font-weight: normal; color: #000000; text-decoration: none; }
        .c11 { border-collapse: collapse; table-layout: auto; width:100%; }
        .c12 { width: 480pt; }
        .c13 { text-align: left; padding-left: 2pt; vertical-align: top; padding-right: 2pt; }
        .c14 { font-family: 'Courier New',Courier; font-size: 10pt; font-style: normal; font-weight: normal; color: #000000; text-decoration: none; }
        .c15 { width: 120pt; }
        .c16 { text-align: right; padding-left: 2pt; vertical-align: top; padding-right: 2pt; }
        .c17 { text-align: right; margin-top: 0em; margin-bottom: 0em; }
        .c18 { text-align: center; margin-left: 5%; margin-right: 5%; }
        .c19 { margin-left: 30pt; margin-right: 0pt; margin-top: 0em; margin-bottom: 0em; list-style: none; }
        .c20 { margin-left: 0pt; margin-right: 0pt; }
        .c21 { margin-top: 0em; margin-bottom: 0em; }
        .c22 { text-align: left; margin-left: 30pt; margin-top: -12pt; }
        --></STYLE>
        <!-- LXNComment 2826:543743167 -->
        <TITLE>&nbsp;</TITLE>
        <META TOPIC="null" DOCUMENTS="500" UPDATED="Tuesday, January 05, 2016  18:08:34 EST" /></HEAD>
        <BODY>
<A NAME="DOC_ID_0_0"></A><!-- Hide XML section from browser
<DOC NUMBER=1>
    <DOCFULL> -->
        <BR><DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">1 of 1301 DOCUMENTS</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Lincoln Journal Star (Nebraska)</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c3"><P CLASS="c1"><SPAN CLASS="c2">August 2, 2001 Thursday</SPAN><SPAN CLASS="c2">&nbsp;</SPAN><SPAN CLASS="c2">&nbsp;<BR>City Edition</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c6">Class counts, not race</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">BYLINE: </SPAN><SPAN CLASS="c2">BUTCH MABIN, Lincoln Journal Star</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">SECTION: </SPAN><SPAN CLASS="c2">A; Pg. 1</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">LENGTH: </SPAN><SPAN CLASS="c2">1779 words</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">DATELINE: </SPAN><SPAN CLASS="c2">Lincoln, NE </SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c8"><SPAN CLASS="c2">Study says geography plays role </SPAN></P>
            <P CLASS="c8"><SPAN CLASS="c2">  The battle lines dividing both sides of the death penalty debate came into sharp focus with Wednesday's release of a comprehensive study examining the fairness of capital punishment in Nebraska. (cut out the remaining body of text)</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">LOAD-DATE: </SPAN><SPAN CLASS="c2">August 11, 2005</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">LANGUAGE: </SPAN><SPAN CLASS="c2">ENGLISH</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">GRAPHIC: </SPAN><SPAN CLASS="c2">A divided time: The Sept. 2, 1994, execution of Harold Otey (above and below) drew more than 1,000 spectators to the Nebraska State Penitentiary - many of them with sharply opposing views of capital punishments. JOURNAL STAR FILE PHOTOS (one photo archived) 3 b/w head photos of Harold Otey, John Joubert and Robert Williams. (photo of Williams not archived)</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Copyright 2001 Lincoln Journal Star,<BR>All Rights Reserved</SPAN></P>
        </DIV>
<!-- Hide XML section from browser
</DOCFULL>
</DOC> -->
<DIV CLASS="c9">&nbsp;</DIV>
<A NAME="DOC_ID_0_1"></A><!-- Hide XML section from browser
<DOC NUMBER=2>
    <DOCFULL> -->
        <BR><DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">2 of 1301 DOCUMENTS</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Lincoln Journal Star (Nebraska)</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c3"><P CLASS="c1"><SPAN CLASS="c2">February 8, 2004 Sunday</SPAN><SPAN CLASS="c2">&nbsp;</SPAN><SPAN CLASS="c2">&nbsp;<BR>City Edition</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c6">Death penalty at crossroads</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">BYLINE: </SPAN><SPAN CLASS="c2">JOE DUGGAN, LINCOLN JOURNAL STAR</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">SECTION: </SPAN><SPAN CLASS="c2">A; Pg. 1</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">LENGTH: </SPAN><SPAN CLASS="c2">2493 words</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">DATELINE: </SPAN><SPAN CLASS="c2">LINCOLN, NE </SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c8"><SPAN CLASS="c2">A legislative bill on lethal injection, U.S. Supreme Court caseand constitutional appeals may affect the future of Nebraska's seven death-row inmates. (cut out the remaining body of text)</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">LOAD-DATE: </SPAN><SPAN CLASS="c2">July 13, 2007</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">LANGUAGE: </SPAN><SPAN CLASS="c2">ENGLISH</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">GRAPHIC: </SPAN><SPAN CLASS="c2">1. Nebraska is the only state in the nation to have the electric chair as the sole means of execution, and some wonder whether the law would survive an Eighth Amendment challenge that it is cruel and unusual punishment. 2. Seven inmates are in death row at the Nebraska State Correctional Institution in Tecumseh. 3. Marylyn Felion's portrait of Robert E. Williams, who was executed in 1997. 7 color head photos and statistics of Carey Dean Moore, Charles Jess Palmer, Michael Ryan, John Lotter, David Dunster, Raymond Mata Jr. and Arthur Lee Gales. color head photo of Summerlin JOURNAL STAR FILE PHOTO</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c7">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN></P>
        </DIV>
        <BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Copyright 2004 Lincoln Journal Star,<BR>All Rights Reserved</SPAN></P>
        </DIV>
<!-- Hide XML section from browser
</DOCFULL>
</DOC> -->
</BODY></HTML>

现在,我想提取每个文档的日期,并尝试遵循this now closed question中提供的指南。但是,这些建议似乎依赖于标签(如“SECTION:”),我只对“LOAD-DATE:”(不总是与标题上方的实际日期相同)。即便如此,尝试建议的表达式,如下所示似乎没有结果:

> ex <- htmlTreeParse("~/Desktop/example.html", encoding="UTF-8")
> example <- xmlRoot(ex)
> xpathSApply(example, "//DOCFULL/*/*/span[text()='SECTION: ']/..", xmlValue)
NULL

如何修复此表达式以提取加载日期或 - 甚至更好 - 每个文档的实际日期?

是否可以将epxression帐户用于缺少日期的文档(即用NA标记它们)?

1 个答案:

答案 0 :(得分:0)

只需删除DOCFULL / *并简化xpath ...

xpathSApply(example, "//span[text()='SECTION: ']/..", xmlValue)
[1] "SECTION: A; Pg. 1" "SECTION: A; Pg. 1"
xpathSApply(example, "//div[@class='c3']/p[@class='c1']/span[@class='c2'][1]", xmlValue)
[1] "August 2, 2001 Thursday" "February 8, 2004 Sunday"

如果节点缺少标记,有很多方法可以添加NA - 这是一个常见的问题。