从URL路径中提取信息

时间:2014-03-17 17:25:19

标签: string r

假设我有许多URL字符串,我想从中提取“有意义”的信息。也就是说,我想知道该url指定的页面是什么。因此,如果该网站是subaru.com,这是来自关于页面,特殊交易页面等。

 [1] "http://www.subaru.com/vehicles/impreza/index.html"                                                                                                                 
 [2] "http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=214495e6-dbe0-6668-9222-00003d7cd876&prid=87&k_affcode=76602"                                            
 [3] "http://www.subaru.com/index.html?s_kwcid=subaru models&k_clickid=3ec14630-aa7f-b968-c389-00003e9a93f9&prid=87&k_affcode=77236"                                     
 [4] "http://www.subaru.com/customer-support.html"                                                                                                                       
 [5] "http://www.subaru.com/"                                                                                                                                            
 [6] "http://www.subaru.com/vehicles/forester/index.html"                                                                                                                
 [7] "http://www.subaru.com/auto-show/detroit-2014.html"                                                                                                                 
 [8] "http://www.subaruofchampaigncounty.com/index.htm"                                                                                                                  
 [9] "http://www.subaru.com/build-your-own/impreza.html?zip=92106"                                                                                                       
[10] "http://www.subaru.com/mobile/index.html"                                                                                                                           
[11] "http://www.subaru.com/"                                                                                                                                            
[12] "http://www.subaru.com/"                                                                                                                                            
[13] "http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=2361a001-195a-29c8-7323-00003c593714&prid=87&k_affcode=76602"                                            
[14] "http://www.subaru.ru/index"                                                                                                                                        
[15] "http://www.subarugeorgetown.com/certified/subaru/2013-subaru-outback-georgetown-tx-1b523a570a0a00de63937097e2f3723d.htm"                                           
[16] "http://www.subaru.com/"                                                                                                                                            
[17] "http://www.subaru.com/?s_kwcid=suburau&k_clickid=41a2c6dc-c9fa-6ac8-9bf0-000044fe28d7&prid=87&k_affcode=2966&gclid=cprrlygp-rscfugs7aodbkiaaw"                     
[18] "http://www.subaru.com/mobile/index.html"                                                                                                                           
[19] "http://www.subaru.com/mobile/index.html"                                                                                                                           
[20] "http://www.subaru.com/enthusiasts/index.html"                                                                                                                      
[21] "http://www.subaru.ru/index"                                                                                                                                        
[22] "http://www.subaru.ru/index"                                                                                                                                        
[23] "http://www.subaru.com/mobile/index.html"                                                                                                                           
[24] "http://www.subaru.com/"                                                                                                                                            
[25] "http://www.subaru.com/"                                                                                                                                            
[26] "http://www.subaru.com/"                                                                                                                                            
[27] "http://www.subaru.com/enthusiasts/index.html"                                                                                                                      
[28] "http://www.subaruofdayton.com/tcd/home/?tcdkwid=22194961&tcdcmpid=19148&tcdadid=6852747105&locale=en_us"                                                           
[29] "http://www.subaru.com/build-your-own/outback.html?sc_brochure=subaru.outback.2014-specifications"                                                                  
[30] "http://www.subaruofatlanta.com/featured-vehicles/used.htm?reset=inventorylisting"                                                                                  
[31] "http://www.subaru.com/customer-support.html"                                                                                                                       
[32] "http://www.subarupacific.com/index.htm?cikw=+subaru&cimt=b&cipl=&cinetwork=search&ciagaid=49620691888&gclid=clhf0uoq-rscffpm7aodtv0aiw"                            
[33] "http://www.subaru.ru/index"                                                                                                                                        
[34] "http://www.subaru.ru/lineup/forester/spec/spec"                                                                                                                    
[35] "http://www.subaru.com/build-your-own/forester.html?zip=37211"                                                                                                      
[36] "http://www.subaru.com/mobile/index.html"                                                                                                                           
[37] "http://www.subaruelcajon.com/index.htm"                                                                                                                            
[38] "http://www.subaru.com/customer-support.html"                                                                                                                       
[39] "http://www.subaru.com/vehicles/brz/index.html?s_kwcid=brz&k_clickid=1ec224f1-18c6-a228-5afb-000047ecef67&prid=87&k_affcode=197257&gclid=cpik35-r-rscfrsffgodhk4ajg"
[40] "http://www.subaru.com/mobile/index.html"                                                                                                                           
[41] "http://www.subaru.com/mobile/index.html"                                                                                                                           
[42] "http://www.subaru.ru/index"                                                                                                                                        
[43] "http://www.subaru.com/"                                                                                                                                            
[44] "http://www.subaru.com/vehicles/xv-crosstrek/index.html"                                                                                                            
[45] "http://www.subaru.com/customer-support.html"                                                                                                                       
[46] "http://www.subaru.com/mobile/index.html"                                                                                                                           
[47] "http://www.subaru.ru/index"                                                                                                                                        
[48] "http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=293d9ff9-a1ad-8489-82d3-00001e3a514f&prid=87&k_affcode=76602"                                            
[49] "http://www.subaruofkingsautomall.com/index.htm"                                                                                                                    
[50] "http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=5ed77da1-f786-55e9-02d1-000055d135fc&prid=87&k_affcode=76602"                                            
[51] "http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=1645e9d9-05b5-1fe8-d2b1-00002a3ce9e8&prid=87&k_affcode=76602"                                            
[52] "https://www.subaru.com/my-subaru/account.html"                                                                                                                     
[53] "http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=0c0e3142-706d-4cc8-830f-00001ba63c96&prid=87&k_affcode=76602"                                            
[54] "http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=3a594c6a-4485-d2c9-aabf-000051bdfc1d&prid=87&k_affcode=76602"                                            
[55] "http://www.subaru.com/"                                                                                                                                            
[56] "http://www.subaru.com/customer-support.html"                                                                                                                       
[57] "http://www.subaru.com/build-your-own/index.html"                                                                                                                   
[58] "http://www.subaru.com/"                                                                                                                                            
[59] "http://www.subaru.com/mobile/index.html"                                                                                                                           
[60] "http://www.subaru.com/vehicles/brz/photos-videos.html?site=370595&placement=96106620&ad=7514606&creative=0"                                                        
[61] "http://www.subaru.com/customer-support.html"                                                                                                                       
[62] "http://www.subaru.com/"                                                                                                                                            
[63] "http://www.subaru.com/"                                                                                                                                            
[64] "http://www.subaru.com/customer-support.html"                                                                                                                       
[65] "http://www.subaru.com/mobile/index.html"                                                                                                                           
[66] "http://www.subaru.com/mobile/index.html"                                                                                                                           
[67] "http://www.subaru.com/"                                                                                                                                            
[68] "http://www.subaru.com/mobile/index.html"                                                                                                                           
[69] "http://www.subaru.com/build-your-own/impreza.html?zip=01504"                                                                                                       
[70] "http://www.subaru.com/enthusiasts/badge-of-ownership/index.html"                                                                                                   
[71] "http://www.subaru.com/"                                                                                                                                            
[72] "http://www.subaru.com/mobile/index.html"                                                                                                                           
[73] "http://www.subaruofcolumbia.com/used-inventory/index.htm"                                                                                                          
[74] "http://www.subaru.com/customer-support.html"                                                                                                                       
[75] "http://www.subaru.com/"                                                                                                                                            
[76] "http://www.subaruofpuyallup.com/tcd/home/?tcdkwid=22163386&tcdcmpid=13971&tcdadid=35753423988&locale=en_us"                                                        
[77] "http://www.subaru.com/mobile/vehicles/forester/index.html"                                                                                                         
[78] "http://www.subaru.com/mobile/index.html"                                                                                                                           
[79] "http://www.subaru.com/"                                                                                                                                            
[80] "http://www.subaru.com/"   

正如您所看到的,我没有具体的规则可以用来从URL字符串中提取一个东西,因为每个都不同。此外,请注意,有些扩展名为.ru而不是.com。现在,我已将以下代码放在一起,但我仍想提取页面(xv-crosstrek,客户支持等)

mydat$URL_One <- gsub(".*www\\.([[:alpha:]]+\\.com).*","\\1", mydat$URL)
mydat$URL_Two <- gsub(".*\\.com","", mydat$URL)   

任何人都可以帮忙完成这项任务吗?

我想我可能想要删除每个URL字符串中的/ index事件。

所以举个例子。

before:
"http://www.subaru.com/vehicles/forester/index.html"   
after:
forester

before:
http://www.subaruofcolumbia.com/used-inventory/index.htm
after:
used-inventory

before:
http://www.subaru.com/build-your-own/forester.html?zip=37211
after:
build-your-own

2 个答案:

答案 0 :(得分:0)

httr包,其功能为parse_url。例如,你可以做

<r> parse_url("http://www.subaru.com/vehicles/forester/index.html")

$scheme
[1] "http"

$hostname
[1] "www.subaru.com"

$port
NULL

$path
[1] "vehicles/forester/index.html"

$query
NULL

$params
NULL

$fragment
NULL

$username
NULL    

$password
NULL

attr(,"class")
[1] "url"

这可能会让你成为那里的一部分。

答案 1 :(得分:0)

R有两个方便的功能。 basename返回url的基本名称,而dirname返回目录名称(或路径)。将urls作为您的前十个网址,我认为我们可以通过以下方式实现您正在寻找的结果。

> urls
# [1] "http://www.subaru.com/vehicles/impreza/index.html"                                                                            
# [2] "http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=214495e6-dbe0-6668-9222-00003d7cd876&prid=87&k_affcode=76602"       
# [3] "http://www.subaru.com/index.html?s_kwcid=subaru models&k_clickid=3ec14630-aa7f-b968-c389-00003e9a93f9&prid=87&k_affcode=77236"
# [4] "http://www.subaru.com/customer-support.html"                                                                                  
# [5] "http://www.subaru.com/"                                                                                                       
# [6] "http://www.subaru.com/vehicles/forester/index.html"                                                                           
# [7] "http://www.subaru.com/auto-show/detroit-2014.html"                                                                            
# [8] "http://www.subaruofchampaigncounty.com/index.htm"                                                                             
# [9] "http://www.subaru.com/build-your-own/impreza.html?zip=92106"                                                                  
# [10] "http://www.subaru.com/mobile/index.html"                                                                                      

> ifelse(grepl('index|zip', basename(urls)),
         gsub('^.*/', '', dirname(urls)), 
         gsub('\\.html', '', basename(urls)))
# [1] "impreza"                         "www.subaru.com"                 
# [3] "www.subaru.com"                  "customer-support"               
# [5] "www.subaru.com"                  "forester"                       
# [7] "detroit-2014"                    "www.subaruofchampaigncounty.com"
# [9] "build-your-own"                  "mobile"