如何只使用R获取链接

时间:2018-05-31 18:26:49

标签: r xml stringr xml2

<item xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="itemWithRetweets" link="http://twitter.com/MEDClementz/statuses/1001775473305817090" id="1001775473305817090">

如何从上面的^

中仅获取链接和id

所需的输出:

       link                                                         
[1] http://twitter.com/MEDClementz/statuses/1001775473305817090    
           id
[1] 1001775473305817090

2 个答案:

答案 0 :(得分:2)

使用xml解析器而不是使用正则表达式

会更好
library(xml2)
x <- read_xml('<item xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="itemWithRetweets" link="http://twitter.com/MEDClementz/statuses/1001775473305817090" id="1001775473305817090"></item>')

xml_attr(x,"link")
xml_attr(x,"id")

结果:

> xml_attr(x,"link")
[1] "http://twitter.com/MEDClementz/statuses/1001775473305817090"
> xml_attr(x,"id")
[1] "1001775473305817090"

答案 1 :(得分:0)

以下是使用stringr包的选项。

library(stringr)

# Create the example string
string <- '<item xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="itemWithRetweets" link="http://twitter.com/MEDClementz/statuses/1001775473305817090" id="1001775473305817090">'

# Split the string
string2 <- str_split(string, pattern = " ")[[1]]

# Get the link
link <- str_subset(string2, "link")
link2 <- str_extract(link, "http://.*[0-9]+")
link2
# [1] "http://twitter.com/MEDClementz/statuses/1001775473305817090"

# Get the id
id <- str_subset(string2, "id")
id2 <- str_extract(id, "[0-9]+")
id2
# [1] "1001775473305817090"