Question

使用XML我可以抓取我需要的URL，但当我使用xpathSApply时，R会返回不需要的\ n和\ t指示符（新行和制表符）。这是一个例子：

doc <- htmlTreeParse("http://www.milesstockbridge.com/offices/", useInternal = TRUE) # scrape and parse an HTML site
xpathSApply(doc, "//div[@class='info']//h3", xmlValue) 
[1] "\n\t\t\t\t\t\tBaltimore\t\t\t\t\t"     "\n\t\t\t\t\t\tCambridge\t\t\t\t\t"     "\n\t\t\t\t\t\tEaston\t\t\t\t\t"        "\n\t\t\t\t\t\tFrederick\t\t\t\t\t"    
[5] "\n\t\t\t\t\t\tRockville\t\t\t\t\t"     "\n\t\t\t\t\t\tTowson\t\t\t\t\t"        "\n\t\t\t\t\t\tTysons Corner\t\t\t\t\t" "\n\t\t\t\t\t\tWashington\t\t\t\t\t"

正如本问题中所解释的，正则表达式函数可以轻松删除不需要的格式元素 how to delete the \n\t\t\t in the result from website data collection?但我宁愿xpath首先完成工作，如果可能的话（我有数百个要解析）。

此外，还有translate这样的功能，显然如此问题：我在Python问题中看到的Using the Translate function to remove newline characters in xml, but how do I ignore certain tags?以及strip()。我不知道使用R和xpath时有哪些可用。

可能text()函数有帮助，但我不知道如何将它包含在我的xpathSApply表达式中。与normalize-space()一样。

Answer 1

您只需要trim = TRUE来电中的xmlValue()参数。

> xpathSApply(doc, "//div[@class='info']//h3", xmlValue, trim = TRUE) 
#[1] "Baltimore"     "Cambridge"     "Easton"       
#[4] "Frederick"     "Rockville"     "Towson"       
#[7] "Tysons Corner" "Washington"

使用R和XPath，如何从结果中删除\ n和\ t等格式元素？

1 个答案: