Question

尊敬的Stackoverflow社区，

我正在尝试使用stringR从网站中提取唯一的数字标识符。网站上有几个独特的doi，在doi结束后，后面是一个“引用”字符。

[1]我从网站获得信息 pg <-read_html（“ https://search.datacite.org/works?query=Movebank&resource-type-id=dataset”） [2]我尝试从网站中获取以“ doi”开头的26个唯一的字符串。

[3] 我打算使用string_match_all，开始时必须匹配“ https://doi.org/”，“ *”之间的某些字符到末尾必须匹配单词“ Cite”。

str_match_all（html_text（html_nodes（pg，“ body”）），pattern =“ ^ https://doi.org/ * Cite $”） [4]这些土井之一的样子的一个例子是：

https://doi.org/10.5441/001/1.41076dq1/6引用

非常感谢您的帮助！

此致

迭戈

Answer 1

在下面的答案中使用与hrbrmstr类似的代码，您可以轻松获取所需的所有URL。 https://stackoverflow.com/a/46674097/10710995

fils <- html_nodes(pg, xpath=".//a[contains(@href, 'doi.org')]")

df <- data.frame(link= html_attr(fils, "href"))

 df
                                          link
1  https://doi.org/10.25504/fairsharing.httzv2
2     https://doi.org/10.5441/001/1.41076dq1/6
3     https://doi.org/10.5441/001/1.q986rc29/3
4     https://doi.org/10.5441/001/1.q986rc29/4
5       https://doi.org/10.5441/001/1.25551gr6
6     https://doi.org/10.5441/001/1.25551gr6/1
7     https://doi.org/10.5441/001/1.25551gr6/2
8     https://doi.org/10.5441/001/1.q8b02dc5/4

使用str_match_all匹配R

1 个答案: