假设我有许多URL字符串,我想从中提取“有意义”的信息。也就是说,我想知道该url指定的页面是什么。因此,如果该网站是subaru.com,这是来自关于页面,特殊交易页面等。
[1] "http://www.subaru.com/vehicles/impreza/index.html"
[2] "http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=214495e6-dbe0-6668-9222-00003d7cd876&prid=87&k_affcode=76602"
[3] "http://www.subaru.com/index.html?s_kwcid=subaru models&k_clickid=3ec14630-aa7f-b968-c389-00003e9a93f9&prid=87&k_affcode=77236"
[4] "http://www.subaru.com/customer-support.html"
[5] "http://www.subaru.com/"
[6] "http://www.subaru.com/vehicles/forester/index.html"
[7] "http://www.subaru.com/auto-show/detroit-2014.html"
[8] "http://www.subaruofchampaigncounty.com/index.htm"
[9] "http://www.subaru.com/build-your-own/impreza.html?zip=92106"
[10] "http://www.subaru.com/mobile/index.html"
[11] "http://www.subaru.com/"
[12] "http://www.subaru.com/"
[13] "http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=2361a001-195a-29c8-7323-00003c593714&prid=87&k_affcode=76602"
[14] "http://www.subaru.ru/index"
[15] "http://www.subarugeorgetown.com/certified/subaru/2013-subaru-outback-georgetown-tx-1b523a570a0a00de63937097e2f3723d.htm"
[16] "http://www.subaru.com/"
[17] "http://www.subaru.com/?s_kwcid=suburau&k_clickid=41a2c6dc-c9fa-6ac8-9bf0-000044fe28d7&prid=87&k_affcode=2966&gclid=cprrlygp-rscfugs7aodbkiaaw"
[18] "http://www.subaru.com/mobile/index.html"
[19] "http://www.subaru.com/mobile/index.html"
[20] "http://www.subaru.com/enthusiasts/index.html"
[21] "http://www.subaru.ru/index"
[22] "http://www.subaru.ru/index"
[23] "http://www.subaru.com/mobile/index.html"
[24] "http://www.subaru.com/"
[25] "http://www.subaru.com/"
[26] "http://www.subaru.com/"
[27] "http://www.subaru.com/enthusiasts/index.html"
[28] "http://www.subaruofdayton.com/tcd/home/?tcdkwid=22194961&tcdcmpid=19148&tcdadid=6852747105&locale=en_us"
[29] "http://www.subaru.com/build-your-own/outback.html?sc_brochure=subaru.outback.2014-specifications"
[30] "http://www.subaruofatlanta.com/featured-vehicles/used.htm?reset=inventorylisting"
[31] "http://www.subaru.com/customer-support.html"
[32] "http://www.subarupacific.com/index.htm?cikw=+subaru&cimt=b&cipl=&cinetwork=search&ciagaid=49620691888&gclid=clhf0uoq-rscffpm7aodtv0aiw"
[33] "http://www.subaru.ru/index"
[34] "http://www.subaru.ru/lineup/forester/spec/spec"
[35] "http://www.subaru.com/build-your-own/forester.html?zip=37211"
[36] "http://www.subaru.com/mobile/index.html"
[37] "http://www.subaruelcajon.com/index.htm"
[38] "http://www.subaru.com/customer-support.html"
[39] "http://www.subaru.com/vehicles/brz/index.html?s_kwcid=brz&k_clickid=1ec224f1-18c6-a228-5afb-000047ecef67&prid=87&k_affcode=197257&gclid=cpik35-r-rscfrsffgodhk4ajg"
[40] "http://www.subaru.com/mobile/index.html"
[41] "http://www.subaru.com/mobile/index.html"
[42] "http://www.subaru.ru/index"
[43] "http://www.subaru.com/"
[44] "http://www.subaru.com/vehicles/xv-crosstrek/index.html"
[45] "http://www.subaru.com/customer-support.html"
[46] "http://www.subaru.com/mobile/index.html"
[47] "http://www.subaru.ru/index"
[48] "http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=293d9ff9-a1ad-8489-82d3-00001e3a514f&prid=87&k_affcode=76602"
[49] "http://www.subaruofkingsautomall.com/index.htm"
[50] "http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=5ed77da1-f786-55e9-02d1-000055d135fc&prid=87&k_affcode=76602"
[51] "http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=1645e9d9-05b5-1fe8-d2b1-00002a3ce9e8&prid=87&k_affcode=76602"
[52] "https://www.subaru.com/my-subaru/account.html"
[53] "http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=0c0e3142-706d-4cc8-830f-00001ba63c96&prid=87&k_affcode=76602"
[54] "http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=3a594c6a-4485-d2c9-aabf-000051bdfc1d&prid=87&k_affcode=76602"
[55] "http://www.subaru.com/"
[56] "http://www.subaru.com/customer-support.html"
[57] "http://www.subaru.com/build-your-own/index.html"
[58] "http://www.subaru.com/"
[59] "http://www.subaru.com/mobile/index.html"
[60] "http://www.subaru.com/vehicles/brz/photos-videos.html?site=370595&placement=96106620&ad=7514606&creative=0"
[61] "http://www.subaru.com/customer-support.html"
[62] "http://www.subaru.com/"
[63] "http://www.subaru.com/"
[64] "http://www.subaru.com/customer-support.html"
[65] "http://www.subaru.com/mobile/index.html"
[66] "http://www.subaru.com/mobile/index.html"
[67] "http://www.subaru.com/"
[68] "http://www.subaru.com/mobile/index.html"
[69] "http://www.subaru.com/build-your-own/impreza.html?zip=01504"
[70] "http://www.subaru.com/enthusiasts/badge-of-ownership/index.html"
[71] "http://www.subaru.com/"
[72] "http://www.subaru.com/mobile/index.html"
[73] "http://www.subaruofcolumbia.com/used-inventory/index.htm"
[74] "http://www.subaru.com/customer-support.html"
[75] "http://www.subaru.com/"
[76] "http://www.subaruofpuyallup.com/tcd/home/?tcdkwid=22163386&tcdcmpid=13971&tcdadid=35753423988&locale=en_us"
[77] "http://www.subaru.com/mobile/vehicles/forester/index.html"
[78] "http://www.subaru.com/mobile/index.html"
[79] "http://www.subaru.com/"
[80] "http://www.subaru.com/"
正如您所看到的,我没有具体的规则可以用来从URL字符串中提取一个东西,因为每个都不同。此外,请注意,有些扩展名为.ru而不是.com。现在,我已将以下代码放在一起,但我仍想提取页面(xv-crosstrek,客户支持等)
mydat$URL_One <- gsub(".*www\\.([[:alpha:]]+\\.com).*","\\1", mydat$URL)
mydat$URL_Two <- gsub(".*\\.com","", mydat$URL)
任何人都可以帮忙完成这项任务吗?
我想我可能想要删除每个URL字符串中的/ index事件。
所以举个例子。
before:
"http://www.subaru.com/vehicles/forester/index.html"
after:
forester
before:
http://www.subaruofcolumbia.com/used-inventory/index.htm
after:
used-inventory
before:
http://www.subaru.com/build-your-own/forester.html?zip=37211
after:
build-your-own
答案 0 :(得分:0)
httr
包,其功能为parse_url
。例如,你可以做
<r> parse_url("http://www.subaru.com/vehicles/forester/index.html")
$scheme
[1] "http"
$hostname
[1] "www.subaru.com"
$port
NULL
$path
[1] "vehicles/forester/index.html"
$query
NULL
$params
NULL
$fragment
NULL
$username
NULL
$password
NULL
attr(,"class")
[1] "url"
这可能会让你成为那里的一部分。
答案 1 :(得分:0)
R有两个方便的功能。 basename
返回url的基本名称,而dirname
返回目录名称(或路径)。将urls
作为您的前十个网址,我认为我们可以通过以下方式实现您正在寻找的结果。
> urls
# [1] "http://www.subaru.com/vehicles/impreza/index.html"
# [2] "http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=214495e6-dbe0-6668-9222-00003d7cd876&prid=87&k_affcode=76602"
# [3] "http://www.subaru.com/index.html?s_kwcid=subaru models&k_clickid=3ec14630-aa7f-b968-c389-00003e9a93f9&prid=87&k_affcode=77236"
# [4] "http://www.subaru.com/customer-support.html"
# [5] "http://www.subaru.com/"
# [6] "http://www.subaru.com/vehicles/forester/index.html"
# [7] "http://www.subaru.com/auto-show/detroit-2014.html"
# [8] "http://www.subaruofchampaigncounty.com/index.htm"
# [9] "http://www.subaru.com/build-your-own/impreza.html?zip=92106"
# [10] "http://www.subaru.com/mobile/index.html"
> ifelse(grepl('index|zip', basename(urls)),
gsub('^.*/', '', dirname(urls)),
gsub('\\.html', '', basename(urls)))
# [1] "impreza" "www.subaru.com"
# [3] "www.subaru.com" "customer-support"
# [5] "www.subaru.com" "forester"
# [7] "detroit-2014" "www.subaruofchampaigncounty.com"
# [9] "build-your-own" "mobile"