假设我有以下链接:
<a class=\"MainCategory\"href=\"/cp/3951?povid=cat1070145-env172199-moduleA080112-lLinkGNAV_Electronics_Computers\">Computers</a>
<a href=\"/browse/electronics/desktop-computers/3944_3951_132982/?_refineresult=true&catNavId=3951&povid=cat1070145-env172199-moduleA080112-lLinkGNAV_Electronics_Computers_Desktops\">Desktops</a>
<a href=\"/cp/Laptops/1089430?povid=cat1070145-env172199-moduleA080112-lLinkGNAV_Electronics_Computers_Laptops\">Laptops</a>
是否有自动提取以下IDS的方法:3951,132982&amp; 1089430及其相应的标签:电脑,台式电脑和笔记本电脑?
答案 0 :(得分:0)
如果您的网址位于以下
中vec <- c("<a class=\"MainCategory\"href=\"/cp/3951?povid=cat1070145-env172199-moduleA080112-lLinkGNAV_Electronics_Computers\">Computers</a>",
"<a href=\"/browse/electronics/desktop-computers/3944_3951_132982/?_refineresult=true&catNavId=3951&povid=cat1070145-env172199-moduleA080112-lLin kGNAV_Electronics_Computers_Desktops\">Desktops</a>",
"<a href=\"/cp/Laptops/1089430?povid=cat1070145-env172199-moduleA080112-lLinkGNAV_Electronics_Computers_Laptops\">Laptops</a>")
您可以使用正则表达式来提取信息:
data.frame(ID = sub(".*[0-9]+_[0-9]+_([0-9]+).*", "\\1",
sub(".*[^0-9]([0-9]+)\\?povid.*", "\\1", vec)),
Label = sub(".*>(.*)</a>$", "\\1", vec))
# ID Label
# 1 3951 Computers
# 2 132982 Desktops
# 3 1089430 Laptops