使用R中的strsplit提取特定字符串

时间:2015-12-31 11:19:56

标签: r

使用以下方法将XMLDocument类型对象转换为Character:

do.call(paste, as.list(capture.output(list_links)))

我想使用strsplit从生成的字符对象中提取特定字符串。 list_links的输出如下。

[1] "[[1]] <a href=\"/Archive/CrossNational.asp\">Cross-National Data</a>   [[2]] <a href=\"/Archive/MultiNation.asp\">Multiple Nation Surveys</a>   [[3]] <a href=\"/Archive/IntSurveys.asp\">Single Nation Surveys</a>   [[4]] <a href=\"/Archive/ChCounty.asp\">County-Level Data</a>   [[5]] <a href=\"/Archive/ChState.asp\">State-Level Data</a>   [[6]] <a href=\"/Archive/NatBaylor.asp\">Baylor Religion Surveys</a>   [[7]] <a href=\"/Archive/GSS.asp\">General Social Surveys</a>   [[8]] <a href=\"/Archive/Polls.asp\">News Polls</a>   [[9]] <a href=\"/Archive/NES.asp\">National Election Studies</a>   [[10]] <a href=\"/Archive/NatFamily.asp\">National Survey of Family Growth</a>   [[11]] <a href=\"/Archive/NSYR.asp\">National Studies of Youth and Religion (NSYR)</a>   [[12]] <a href=\"/Archive/PewResearch.asp\">Pew Research Center</a>   [[13]] <a href=\"/Archive/PALS.asp\">Portraits of American Life Study (PALS)</a>   [[14]] <a href=\"/Archive/PRRI.asp\">Public Religion Research Institute (PRRI)</a>   [[15]] <a href=\"/Archive/NatOther.asp\">Other National Surveys</a>   [[16]] <a href=\"/Archive/State1stAmnd.asp\">State of the First Amendment Surveys</a>   [[17]] <a href=\"/Archive/Middletown.asp\">Middletown Data</a>   [[18]] <a href=\"/Archive/Sfocus.asp\">Southern Focus Polls</a>   [[19]] <a href=\"/Archive/RegOther.asp\">Other Local/Regional Surveys</a>   [[20]] <a href=\"/Archive/FCT.asp\">Faith Communities Today</a>   [[21]] <a href=\"/Archive/NCS.asp\">National Congregations Study</a>   [[22]] <a href=\"/Archive/USCLS.asp\">U.S. Congregational Life Survey</a>   [[23]] <a href=\"/Archive/CongOther.asp\">Other Surveys</a>   [[24]] <a href=\"/Archive/Adventist.asp\">Adventist</a>   [[25]] <a href=\"/Archive/Baptist.asp\">Baptist</a>   [[26]] <a href=\"/Archive/Catholic.asp\">Catholic</a>   [[27]] <a href=\"/Archive/Jewish.asp\">Jewish</a>   [[28]] <a href=\"/Archive/Lutheran.asp\">Lutheran</a>   [[29]] <a href=\"/Archive/Methodist.asp\">Methodist</a>   [[30]] <a href=\"/Archive/Mormon.asp\">Mormon</a>   [[31]] <a href=\"/Archive/Nazarene.asp\">Nazarene</a>   [[32]] <a href=\"/Archive/Presbyterian.asp\">Presbyterian</a>   [[33]] <a href=\"/Archive/Unitarian.asp\">Unitarian-Universalist</a>   [[34]] <a href=\"/Archive/GrpOther.asp\">Other Groups</a>   [[35]] <a href=\"/Archive/InstructData.asp\">Instructional Data Files</a>   [[36]] <a href=\"/Archive/Other.asp\">Other Data</a>  "

我想提取a标签中每个网址的列表。即使用strsplit后我列表中的第一个对象应为“/Archive/CrossNational.asp”

1 个答案:

答案 0 :(得分:0)

这将使用txtstrsplit - 对象执行此操作,但这并不是每个人都可能选择的功能。在拆分href-preamble和关闭标记后,此代码收集偶数项。 &#34;分裂&#34;参数是包含两部分的OR-ed组合。有关R正则表达式的更多详细信息,请参阅?regex

 strsplit(txt, "\\]\\] <a href\\=\\\"|\\\">")[[1]][c(FALSE,TRUE)]
#--- result ----

 [1] "/Archive/CrossNational.asp" "/Archive/MultiNation.asp"  
 [3] "/Archive/IntSurveys.asp"    "/Archive/ChCounty.asp"     
 [5] "/Archive/ChState.asp"       "/Archive/NatBaylor.asp"    
 [7] "/Archive/GSS.asp"           "/Archive/Polls.asp"        
 [9] "/Archive/NES.asp"           "/Archive/NatFamily.asp"    
[11] "/Archive/NSYR.asp"          "/Archive/PewResearch.asp"  
[13] "/Archive/PALS.asp"          "/Archive/PRRI.asp"         
[15] "/Archive/NatOther.asp"      "/Archive/State1stAmnd.asp" 
[17] "/Archive/Middletown.asp"    "/Archive/Sfocus.asp"       
[19] "/Archive/RegOther.asp"      "/Archive/FCT.asp"          
[21] "/Archive/NCS.asp"           "/Archive/USCLS.asp"        
[23] "/Archive/CongOther.asp"     "/Archive/Adventist.asp"    
[25] "/Archive/Baptist.asp"       "/Archive/Catholic.asp"     
[27] "/Archive/Jewish.asp"        "/Archive/Lutheran.asp"     
[29] "/Archive/Methodist.asp"     "/Archive/Mormon.asp"       
[31] "/Archive/Nazarene.asp"      "/Archive/Presbyterian.asp" 
[33] "/Archive/Unitarian.asp"     "/Archive/GrpOther.asp"     
[35] "/Archive/InstructData.asp"  "/Archive/Other.asp"