如何删除数据框中具有字母数字值的列中除某些单词以外的所有单词?

时间:2019-06-12 14:44:42

标签: r

因此,我在数据框中有一个包含所有这些信息的列,但我想对其进行过滤,并删除该列每一行中“ Pericarida”之前的每个单词,包括也删除“ Pericarida”这个单词。

1.KU189316.1.2308 Eukaryota Opisthokonta Holozoa Metazoa (Animalia) Eumetazoa Bilateria Arthropoda Crustacea Malacostraca Eumalacostraca Peracarida Thermosphaeroma subequalum

2.EU414446.1.2220 Eukaryota Opisthokonta Holozoa Metazoa (Animalia) Eumetazoa Bilateria Arthropoda Crustacea Malacostraca Eumalacostraca Peracarida Betamorpha africana

3.JF699592.1.2323 Eukaryota Opisthokonta Holozoa Metazoa (Animalia) Eumetazoa Bilateria Arthropoda Crustacea Malacostraca Eumalacostraca Peracarida Scutuloidea maculata

预期结果应该是这样的,但是到目前为止,我还不知道如何去做。 预先感谢您的任何答复。

1.Thermosphaeroma subequalum

2.Betamorpha africana

3.Scutuloidea maculata

1 个答案:

答案 0 :(得分:0)

可以选择sub来捕获数字,然后在字符串的开头(\\d+\\.)处加一个点(^),然后捕获后面跟有'Peracarida'的字符一个或多个空格(\\s+)。在替换中,使用捕获组的后向引用(\\1\\2

sub("^(\\d+\\.).*\\sPeracarida\\s+(.*)$", "\\1\\2", str1)
#[1] "1.Thermosphaeroma subequalum" "2.Betamorpha africana"    
#[3]    "3.Scutuloidea maculata" 

注意:在这里,我们假设OP也想在开头选择数字。如果不需要,那就做

sub(".*\\bPeracarida\\s*", "", str1)
#[1] "Thermosphaeroma subequalum" "Betamorpha africana"  
#[3] "Scutuloidea maculata"   

数据

str1 <- c("1.KU189316.1.2308 Eukaryota Opisthokonta Holozoa Metazoa (Animalia) Eumetazoa Bilateria Arthropoda Crustacea Malacostraca Eumalacostraca Peracarida Thermosphaeroma subequalum", 
"2.EU414446.1.2220 Eukaryota Opisthokonta Holozoa Metazoa (Animalia) Eumetazoa Bilateria Arthropoda Crustacea Malacostraca Eumalacostraca Peracarida Betamorpha africana", 
"3.JF699592.1.2323 Eukaryota Opisthokonta Holozoa Metazoa (Animalia) Eumetazoa Bilateria Arthropoda Crustacea Malacostraca Eumalacostraca Peracarida Scutuloidea maculata"
)