在R

时间:2017-10-12 19:10:46

标签: r regex split tidyr

我正在处理一个数据集,其中一列(Place)由位置句组成。

librabry(tidyverse)  

example <- tibble(Datum = c("October 1st 2017", 
                            "October 2st 2017",
                            "October 3rd 2017"),
             Place = c("Tabiyyah Jazeera village, 20km south east of Deir Ezzor, Deir Ezzor Governorate, Syria",
                       "Abu Kamal, Deir Ezzor Governorate, Syria",
                       "شارع القطار al Qitar [train] street, al-Tawassiya area, north of Raqqah city centre, Raqqah governorate, Syria"))

我想用逗号分隔符拆分Place列,因此我更喜欢使用tidyverse package的解决方案。因为Place的值有不同的长度,所以我想从右到左开始。因此,国家/地区Syria是此数据框最后一列中的值。

哦,对于使用RegEx代码的奖金,我会删除阿拉伯字符吗?

提前致谢。

编辑:找到我的答案: 删除阿拉伯字符(感谢@ g5w):

gsub("[\u0600-\u06FF]", "", airstrikes_okt_clean$Plek)

以整齐的方式拆分列:

airstrikes_okt_clean <- separate(example, 
                             Place, 
                             into = c("detail", 
                                      "detail2", 
                                      "City_or_village", 
                                      "District", 
                                      "Country"), 
                             sep = ",", 
                             fill = "left") 

2 个答案:

答案 0 :(得分:1)

只需将字符串拆分为逗号即可。

 lapply(strsplit(Place, ","), rev)
[[1]]
[1] " Syria"                         " Deir Ezzor Governorate"       
[3] " 20km south east of Deir Ezzor" "Tabiyyah Jazeera village"      

[[2]]
[1] " Syria"                  " Deir Ezzor Governorate"
[3] "Abu Kamal"              

[[3]]
[1] " Syria"                              " Raqqah governorate"                
[3] " north of Raqqah city centre"        " al-Tawassiya area"                 
[5] "شارع القطار al Qitar [train] street"

要在分割前删除阿拉伯字符,请尝试

gsub("[\u0600-\u06FF]", "", Place)
[1] "Tabiyyah Jazeera village, 20km south east of Deir Ezzor, Deir Ezzor Governorate, Syria"              
[2] "Abu Kamal, Deir Ezzor Governorate, Syria"                                                            
[3] "  al Qitar [train] street, al-Tawassiya area, north of Raqqah city centre, Raqqah governorate, Syria"

答案 1 :(得分:0)

这是一个单行。

sapply(strsplit(example$Place, ","), function(x) trimws(x[length(x)]))

它将在最后一个逗号之后返回字符串,无论是Syria还是其他任何逗号。