从R中的url中提取字符串

时间:2019-05-09 18:53:38

标签: r string text-extraction

我的数据框包含一个URL字段,该字段有时包含一个13位的产品标识符。我需要提取此产品ID并将其写入新的列调用ISBN。以下是3个不同的URL,每个URL的产品ID均不同:

>https://catalog.macmillan.com/childrens/book/brazen/rebel-ladies-who-rocked-the-world/pnlope-bagieu/**9781626728691**?utm_source=exacttarget&utm_medium=newsletter&utm_term=na-schoolandlibrary&utm_content=na-discover-nl&utm_campaign=schoolandlibrary
>https://us.macmillan.com/excerpt?isbn=**9781250151025**&utm_source=exacttarget&utm_medium=newsletter&utm_term=na-schoolandlibrary&utm_content=na-discover-nl&utm_campaign=schoolandlibrary
>https://catalog.macmillan.com/childrens/book/so-tall-within/sojourner-truths-long-walk-toward-freedom/gary-d-schmidt/daniel-minter/**9781626728721**?utm_source=exacttarget&utm_medium=newsletter&utm_term=na-schoolandlibrary&utm_content=na-discover-nl&utm_campaign=schoolandlibrary

1 个答案:

答案 0 :(得分:1)

使用gregexpr,假设产品编号的长度始终为13,如图所示。

regmatches(tx, gregexpr("(\\d{13})", tx))
# [[1]]
# [1] "9781626728691" "9781250151025" "9781626728721"

数据

tx <- "https://catalog.macmillan.com/childrens/book/brazen/rebel-ladies-who-rocked-the-world/pnlope-bagieu/9781626728691?utm_source=exacttarget&utm_medium=newsletter&utm_term=na-schoolandlibrary&utm_content=na-discover-nl&utm_campaign=schoolandlibrary https://us.macmillan.com/excerpt?isbn=9781250151025&utm_source=exacttarget&utm_medium=newsletter&utm_term=na-schoolandlibrary&utm_content=na-discover-nl&utm_campaign=schoolandlibrary https://catalog.macmillan.com/childrens/book/so-tall-within/sojourner-truths-long-walk-toward-freedom/gary-d-schmidt/daniel-minter/9781626728721?utm_source=exacttarget&utm_medium=newsletter&utm_term=na-schoolandlibrary&utm_content=na-discover-nl&utm_campaign=schoolandlibrary"