从R中的字符串中提取不同的单词

时间:2017-08-16 18:43:20

标签: r string character gsub strsplit

我已经看到几个SO帖子似乎接近回答这个问题,但我不知道是否真的这样做请原谅我这是一个重复的帖子。我有几十个字符串(这是数据框中的一列),包含不同的数字,通常写成单词但有时作为整数。 E.g:

1 adult, ten neonates nearby

Two adults and six neonates

data.frame(Adults=c(1,1,6), Neonates=c(3,10,6)

我的最终目标是能够从每个字符串中提取新生儿和成人的数量,并获得以下内容:

gsub

但是字符串中数字的数量和位置会有所不同。我使用strsplitc("one","two",...,"ten")等看到的所有示例似乎只有在用于替换,拆分,提取等的模式在字符串中相同或保持在一个恒定位置时才起作用在字符串中。因为我知道数字必须是{{1}},所以我可以循环遍历每个字符串然后循环遍历每个可能的数字以查看它是否存在于字符串中然后,如果存在,则提取它并转换为数字。但这似乎效率很低。

任何帮助都会非常感激!!

7 个答案:

答案 0 :(得分:0)

使用str_split包中的stringr和自定义函数的一种潜在方法     包装查找匹配和后期处理。数据集大小尚未提及因此无法测试/评论速度。

library(stringr) #for str_split

customFun = function(
strObj="Three neonates with one adult",
rootOne = "adult",
rootTwo = "neonate"){

#split string
discreteStr = str_split(strObj,pattern = "\\s+",simplify = TRUE)



#find indices of root words
rootOneIndex = grep(rootOne,discreteStr)
rootTwoIndex = grep(rootTwo,discreteStr)

#mapping vectors
charVec = c("one","two","three","four","five","six","seven","eight","nine","ten")
numVec = as.character(1:10)
names(numVec) = charVec

#match index neighbourhood ,-1/+1  and select first match
rootOneMatches = tolower(discreteStr[c(rootOneIndex-1,rootOneIndex+1)])
rootOneMatches = rootOneMatches[!is.na(rootOneMatches)]
rootOneMatches = head(rootOneMatches,1)


rootTwoMatches = tolower(discreteStr[c(rootTwoIndex-1,rootTwoIndex+1)])
rootTwoMatches = rootTwoMatches[!is.na(rootTwoMatches)]
rootTwoMatches = head(rootTwoMatches,1)

#check presence in mapping vectors
rootOneNum = intersect(rootOneMatches,c(charVec,numVec))
rootTwoNum = intersect(rootTwoMatches,c(charVec,numVec))

#final matches and numeric conversion
rootOneFinal = ifelse(!is.na(as.numeric(rootOneNum)),as.numeric(rootOneNum),as.numeric(numVec[rootOneNum]))
rootTwoFinal = ifelse(!is.na(as.numeric(rootTwoNum)),as.numeric(rootTwoNum),as.numeric(numVec[rootTwoNum]))

outDF = data.frame(strObj = strObj,adults = rootOneFinal,neonates = rootTwoFinal,stringsAsFactors=FALSE) 
return(outDF)
}

<强>输出:

inputVec = c("Three neonates with one adult","1 adult, ten neonates nearby","Two adults and six neonates")
outputAggDF = suppressWarnings(do.call(rbind,lapply(inputVec,customFun)))

outputAggDF
#                         strObj adults neonates
#1 Three neonates with one adult      1        3
#2  1 adult, ten neonates nearby      1       10
#3   Two adults and six neonates      2        6

答案 1 :(得分:0)

我能够得到最终结果,但我承认我的代码并不漂亮。

        "processing": true,
        "serverSide": true,

答案 2 :(得分:0)

其他人的速度要快一些,但如果您有兴趣,这里的方法略有不同。

在我看来,主要问题是替换"one" "two"等。字符串,输入相当繁琐,高数字是不可能的。

strings <- c("Three neonates with one adult",
"1 adult, ten neonates nearby",
"Two adults and six neonates")

numbers <- c("one","two","three","four","five","six","seven","eight","nine","ten")

splitted <- unlist(strsplit(strings, split="[[:blank:] | [:punct:]]"))

ind_neon <- which((splitted == "neonates") | (splitted == "neonate"))
ind_adul <- which((splitted == "adults") | (splitted == "adult"))

neon <- tolower(splitted[ind_neon-1])
adul <- tolower(splitted[ind_adul-1])

neon2 <- as.numeric(neon)
neon2[is.na(neon2)] <- as.numeric(factor(neon[is.na(neon2)],
               levels=numbers,
               labels=(1:10)))

adul2 <- as.numeric(adul)
adul2[is.na(adul2)] <- as.numeric(factor(adul[is.na(adul2)],
                levels=numbers,
                labels=(1:10)))

adul2
# [1] 1 1 2
neon2
# [1]  3 10  6

答案 3 :(得分:0)

肯定有更高效的选项,但这可以解决问题,如果将它们添加到模式向量中,可以使用更多数字。

library(stringr)
library(qdap)
library(tidyr)

带来数据

 v <- tolower(c("Three neonates with one adult",
           "1 adult, ten neonates nearby",
           "Two adults and six neonates"))

为模式分配单词和数字向量

words<- c("one","two","three","four","five","six","seven","eight","nine","ten")
nums <- seq(1, 10)
pattern <- c(words, nums)

提取并粘贴所有数字和类型

w <- paste(unlist(str_extract_all( v, paste(pattern, collapse="|"))),
           unlist(str_extract_all( v, "neonate|adult")))

使用qdap中的mutliple gsub将所有写入的数字替换为相应的整数

w <- mgsub(words, nums, w)
w <- do.call(rbind.data.frame, strsplit(w, " "))
names(w) <- c("numbers", "name")

生成rowid,以便传播数据。

w$row <- rep(1:(nrow(w)/2), each=2)
spread(w, name, numbers)[-c(1)]


#    adult neonate
#  1     1       3
#  2     1      10
#  3     2       6

答案 4 :(得分:0)

strapply包中的

gsubfn允许提取单词,如下所示。我找不到任何内置函数将单词转换为数字,反之亦然,但可能会有其他用户创建的预构建函数。

> library(gsubfn)
> df <- data.frame(Text = c("Three neonates with one adult","1 adult, ten neonates nearby","Two adults and six neonates"))
> df
                           Text
1 Three neonates with one adult
2  1 adult, ten neonates nearby
3   Two adults and six neonates

> for(i in 1:nrow(df)){
+     
+     df$Adults[i] <- strapply(as.character(df$Text[i]), "(\\w+) adult*")
+     df$Neonates[i] <- strapply(as.character(df$Text[i]), "(\\w+) neonate*")
+     
+ }

> df
                           Text Adults Neonates
1 Three neonates with one adult    one    Three
2  1 adult, ten neonates nearby      1      ten
3   Two adults and six neonates    Two      six

答案 5 :(得分:0)

这是一个简单的答案,只使用基础R而没有任何花哨的包装; - )

如果您只有1到10个新生儿/成年人,并且如果他们总是以X adult(s)Y neonate(s)(即该类别之前的数字)输入您的字符串,那么这很简单:

df = data.frame(strings = c("Three neonates with one adult",
                            "1 adult, ten neonates nearby",
                            "Two adults and six neonates"))

littnums = c('one', 'two', 'three', 'four', 'five', 
             'six', 'seven', 'eight', 'nine', 'ten')
nums = 1:10

getnums = function(mystring, mypattern) {
  # split your string at all spaces
  mysplitstring = unlist(strsplit(mystring, split=' '))
  # The number you are looking for is before the pattern
  numBeforePattern = mysplitstring[grep(mypattern, mysplitstring) - 1]
  # Then convert it to a integer or, if it fails, translate it 
  ifelse(is.na(suppressWarnings(as.integer(numBeforePattern))), 
         nums[grep(tolower(numBeforePattern), littnums)], 
         as.integer(numBeforePattern))
}

df$Neonates = sapply(as.vector(df$strings), FUN=getnums, 'neonate')
df$Adults = sapply(as.vector(df$strings), FUN=getnums, 'adult')
df
#                         strings Neonates Adults
# 1 Three neonates with one adult        3      1
# 2  1 adult, ten neonates nearby       10      1
# 3   Two adults and six neonates        6      2

答案 6 :(得分:0)

这是另一种方法

您的数据

S <- c("Three neonates with one adult", "1 adult, ten neonates nearby", "Two adults and six neonates")

dplyr和stringr方法

library(stringr)
library(dplyr)

searchfor <- c("neonates", "adult")         
words <- str_extract_all(S, boundary("word"))   # keep only words

下一个语句将在所有searchfor字词之前抓住该字词,并保存为data.frame

chrnum <- as.data.frame(Reduce(cbind, lapply(searchfor, function(y) lapply(words, function(x) x[which(x %in% y)-1]))))

下一个语句将str_replace_all使用命名向量并转换为数字

replaced <- chrnum %>% 
              mutate_all(funs(as.numeric(str_replace_all(tolower(.), c("one" = "1", "two" = "2", "three" = "3", "four" = "4", "five" = "5", "six" = "6", "seven" = "7", "eight" = "8", "nine" = "9", "ten" = "10"))))) %>%
              setNames(searchfor)

注意您将收到有关强制执行NA值的警告

输出

  neonates adult
1        3     1
2       10     1
3        6    NA