R提取字符串的一部分

时间:2016-02-16 05:03:46

标签: r

我需要从地址变量中提取字符串的一部分。我的数据看起来像这样

"                                                                      
[45] "Matara Road, Habaraduwa | Talpe, Unawatuna, Galle GL 80630, Sri Lanka "                                   
[46] "Jungle Beach Road, Buonavista | Rumassala, Unawatuna, Galle 80600, Sri Lanka "                            
[47] "10 Church Street | inside the Fort, Galle, Sri Lanka "                                                    
[48] "78 Mile Post Matara Road Mihiripenna, Unawatuna, Galle 80615, Sri Lanka "                                 
[49] "No: 288 Galle Road | Dadella, Galle 80000, Sri Lanka "                                                    
[50] "Matara Road, Koggala, Galle, Sri Lanka "  

我想从这个字符串中提取城市,在这种情况下应该是“Galle”。我能想到的唯一模式是它出现在“斯里兰卡”之前。或者城市位于“,”和“斯里兰卡”之间。这是我使用的代码

gsub("\\.s*|(, Sri Lanka).*", "", a)

但是使用此代码我得到以下结果。

[45] "Matara Road, Habaraduwa | Talpe, Unawatuna, Galle GL 80630"                                   
[46] "Jungle Beach Road, Buonavista | Rumassala, Unawatuna, Galle 80600"                            
[47] "10 Church Street | inside the Fort, Galle"                                                    
[48] "78 Mile Post Matara Road Mihiripenna, Unawatuna, Galle 80615"                                 
[49] "No: 288 Galle Road | Dadella, Galle 80000"                                                    
[50] "Matara Road, Koggala, Galle" 

无论如何只保留城市

3 个答案:

答案 0 :(得分:1)

n <- c(
     "Matara Road, Habaraduwa | Talpe, Unawatuna, Galle GL 80630, Sri Lanka "       ,
     "Jungle Beach Road, Buonavista | Rumassala, Unawatuna, Galle 80600, Sri Lanka ",
     "10 Church Street | inside the Fort, Galle, Sri Lanka "                        ,
     "78 Mile Post Matara Road Mihiripenna, Unawatuna, Galle 80615, Sri Lanka "     ,
     "No: 288 Galle Road | Dadella, Galle 80000, Sri Lanka "                        ,
     "Matara Road, Koggala, Galle, Sri Lanka " )

首先,您要提取带有可能的州名和可能的邮政编码&gt;

的城市名称
m <- sub('.*, (.*), Sri Lanka *$', '\\1', n)

m现在是:

  

[1]“Galle GL 80630”“Galle 80600”“Galle”“Galle 80615”“Galle 80000”“Galle”

解压缩邮政编码

l <- sub(' \\d{5} *$', '', m )

l是:

  

[1]“Galle GL”“Galle”“Galle”“Galle”“Galle”“Galle”

最后,提取州名缩写

sub('( \\w{2})$', '', l)
  

[1]“Galle”“Galle”“Galle”“Galle”“Galle”“Galle”

答案 1 :(得分:0)

我会改用strsplit:

line  <- "Matara Road, Habaraduwa | Talpe, Unawatuna, Galle GL"
array <- strsplit(line,",")[[1]]
city  <- array[length(array)-1]

试试吧!

摆脱数字只需要城市并用gsub删除它们。希望它有所帮助!

答案 2 :(得分:0)

您可以编写一个函数来以逗号分割字符串,并采用通常为城市名称的倒数第二个元素。

myfunction=function(x)
{
    x=strsplit(x,",")[[1]][length(unlist(strsplit(x,",")))-1]
    x=gsub("[[:digit:]]","",x )
}

这个功能完成了这项工作。此外,它然后删除任何数字/数字。

现在在lapply函数中使用它来获得所需的输出

lapply(x,myfunction)