使用条件语句在R中使用XpathSApply提取和映射多个XML节点

时间:2015-09-11 16:15:21

标签: r xml-parsing

我使用代码从网站检索了xml文件:

library (XML)
abstract <- xmlParse(file = 'http://ieeexplore.ieee.org/gateway/ipsSearch.jsp?querytext=%28systematic%20review%20OR%20systematic%20literature%20review%20AND%20text%20mining%20techniques%29&pys=2009&&hc=1000', isURL = T)

返回的xml如下所示:

<?xml version="1.0" encoding="UTF-8"?>

<root>

<totalfound>40420</totalfound>

<totalsearched>3735435</totalsearched>

<document>

<rank>1</rank>

<title><![CDATA[Estimating the Helpfulness and Economic Impact of Product Reviews: Mining Text and Reviewer Characteristics]]></title>

<authors><![CDATA[Ghose, A.;  Ipeirotis, P.G.]]></authors>

<affiliations><![CDATA[Dept. of Inf., Oper., & Manage. Sci., New York Univ., New York, NY, USA]]></affiliations>

<controlledterms>

<term><![CDATA[Internet]]></term>

<term><![CDATA[data mining]]></term>

<term><![CDATA[electronic commerce]]></term>

<term><![CDATA[pattern classification]]></term>

</controlledterms>

<thesaurusterms>

<term><![CDATA[Communities]]></term>

<term><![CDATA[Economics]]></term>

<term><![CDATA[History]]></term>

<term><![CDATA[Marketing and sales]]></term>

<term><![CDATA[Measurement]]></term>

</thesaurusterms>

<pubtitle><![CDATA[Knowledge and Data Engineering, IEEE Transactions on]]></pubtitle>

<punumber><![CDATA[69]]></punumber>

<pubtype><![CDATA[Journals & Magazines]]></pubtype>

<publisher><![CDATA[IEEE]]></publisher>

<volume><![CDATA[23]]></volume>

<issue><![CDATA[10]]></issue>

<py><![CDATA[2011]]></py>

<spage><![CDATA[1498]]></spage>

<epage><![CDATA[1512]]></epage>

<abstract><![CDATA[With the rapid growth of the Internet, the ability of users to create and publish content has created active electronic communities that provide a wealth of product information. However, the high volume of reviews that are typically published for a single product makes harder for individuals as well as manufacturers to locate the best reviews and understand the true underlying quality of a product. In this paper, we reexamine the impact of reviews on economic outcomes like product sales and see how different factors affect social outcomes such as their perceived usefulness. Our approach explores multiple aspects of review text, such as subjectivity levels, various measures of readability and extent of spelling errors to identify important text-based features. In addition, we also examine multiple reviewer-level features such as average usefulness of past reviews and the self-disclosed identity measures of reviewers that are displayed next to a review. Our econometric analysis reveals that the extent of subjectivity, informativeness, readability, and linguistic correctness in reviews matters in influencing sales and perceived usefulness. Reviews that have a mixture of objective, and highly subjective sentences are negatively associated with product sales, compared to reviews that tend to include only subjective or only objective information. However, such reviews are rated more informative (or helpful) by other users. By using Random Forest-based classifiers, we show that we can accurately predict the impact of reviews on sales and their perceived usefulness. We examine the relative importance of the three broad feature categories: &#x201C;reviewer-related&#x201D; features, &#x201C;review subjectivity&#x201D; features, and &#x201C;review readability&#x201D; features, and find that using any of the three feature sets results in a statistically equivalent performance as in the case of using all available features. This paper is the first study that integrates eco- - nometric, text mining, and predictive modeling techniques toward a more complete analysis of the information captured by user-generated online reviews in order to estimate their helpfulness and economic impact.]]></abstract>

<issn><![CDATA[1041-4347]]></issn>

<htmlFlag><![CDATA[1]]></htmlFlag>

<arnumber><![CDATA[5590249]]></arnumber>

<doi><![CDATA[10.1109/TKDE.2010.188]]></doi>

<publicationId><![CDATA[5590249]]></publicationId>

<partnum><![CDATA[5590249]]></partnum>

<mdurl><![CDATA[http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=5590249&contentType=Journals+%26+Magazines]]></mdurl>

<pdf><![CDATA[http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5590249]]></pdf>

</document>

我想提取标题并与作者匹配。我在“// title”和“// authors”上使用了XpathSApply和getNode:

getNodeSet(abstract, "//title")
getNodeSet(abstract, "//authors")
titlenodes <- xpathSApply(abstract, "//title")
然后我发现有些文件没有标题。因此,如果我单独提取,则无法将标题与其对应的作者匹配。我需要一种方法来检测哪个文档没有标题,并选择onlu作者为这些文件返回NA作为标题。

2 个答案:

答案 0 :(得分:1)

考虑将所有XML内容导入父节点document之外的数据框中。通过这种方式,您可以看到哪些行缺少标题和/或作者。

xmldf <- xmlToDataFrame(nodes = getNodeSet(abstract, "//document")) 

# subset data frame of only title and author (to see NAs)
titleauthorsdf <- xmldf[, c("title", "authors")]

 # character vector of authors with no titles
notitleauthorslist <- c(xmldf$authors[is.na(xmldf$title)])

答案 1 :(得分:0)

如果你想要的只是一个没有标题的作者列表,你可以这样做:

xpathSApply(abstract,"//document[not(title)]/authors", xmlValue)
#  [1] "Armstrong, R.;  Baillie, C.;  Cumming-Potvin, W."         "Stede, M."                                               
#  [3] "Government Documents"                                     "Piotrowski, M."                                          
# ...