我有文本数据(在R中),并希望用数据框中的其他字符替换某些字符。我认为这将是一个简单的任务,使用空间上的strsplit并创建一个矢量,然后我可以使用匹配(%in%)然后可以粘贴在一起。但后来我想到了标点符号。句子的最后一个单词和结尾的标点符号之间没有空格。
我认为可能有一种更简单的方式来实现我想要的东西,而不是变成我的代码的令人费解的混乱。我很欣赏这个问题的方向。
#Character String
x <- "I like 346 ice cream cones. They're 99 percent good! I ate 46."
#Replacement Values Dataframe
symbol text
1 "346" "three hundred forty six"
2 "99" "ninety nine"
3 "46" "forty six"
#replacement dataframe
numDF <-
data.frame(symbol = c("346","99", "46"),
text = c("three hundred forty six", "ninety nine","forty six"),
stringsAsFactors = FALSE)
期望的结果:
[1] "I like three hundred forty six ice cream cones. They're ninety nine percent good! You ate forty six?")
编辑:我原来认为这个有条件的gsub,因为即使没有涉及gsub,它对我来说也是如此。
答案 0 :(得分:8)
在Josh O'Brien的回答的启发下,也许这样做:
x <- "I like 346 ice cream cones. They're 99 percent good! I ate 46."
numDF <- structure(c("346", "99", "46", "three hundred forty six", "ninety nine",
"forty six"), .Dim = c(3L, 2L), .Dimnames = list(c("1", "2",
"3"), c("symbol", "text")))
pat <- paste(numDF[,"symbol"], collapse="|")
repeat {
m <- regexpr(pat, x)
if(m==-1) break
sym <- regmatches(x,m)
regmatches(x,m) <- numDF[match(sym, numDF[,"symbol"]), "text"]
}
x
答案 1 :(得分:6)
此解决方案在同名包中使用gsubfn
:
library(gsubfn)
(pat <- paste(numDF$symbol, collapse="|"))
# [1] "346|99|46"
gsubfn(pattern = pat,
replacement = function(x) {
numDF$text[match(x, numDF$symbol)]
},
x)
[1] "I like three hundred forty six ice cream cones. They're ninety nine percent good! I ate forty six."
答案 2 :(得分:4)
您可以拆分空白或字边界(在单词和标点符号之间匹配):
> x
[1] "I like 346 ice cream cones. They're 99 percent good! I ate 46."
> strsplit(x, split='\\s|\\>|\\<')
[[1]]
[1] "I" "like" "346" "ice" "cream" "cones" "."
[8] "" "They" "'re" "99" "percent" "good" "!"
[15] "" "I" "ate" "46" "."
然后你可以做替换。
答案 3 :(得分:3)
使用Reduce
中的base
的另一种解决方案。
list_df <- apply(numDF, 1, as.list)
Reduce(function(x, l) gsub(l$symbol, l$text, x), list_df, init = x)
EDIT。以下是直接使用numbers2words
函数的完整解决方案..
list_df <- as.numeric(regmatches(x, gregexpr('[0-9]+', x))[[1]])
Reduce(function(x, l) gsub(l, numbers2words(l), x), list_df, init = x)
答案 4 :(得分:2)
目前还不清楚你是否真的想将数字转换成它们的alpha等价物。如果是这样,那么这是一个更为一般的策略。在rhelp档案中有(至少)两个数字到文本转换函数:Jim Lemon的digits2text
和John Fox的numberstowords
。我还切换到gregexpr
以获得矢量化方法:
剪切和粘贴Lemon's function from the HTML found here开箱即用:
> m <- gregexpr("[0-9]+", x)
> sym <- regmatches(x,m)
> regmatches(x,m) <- digits2text(as.numeric(sym[[1]]))
illion = 0
digilen = 3
digitext = three hundred forty six
[1] 6 4 3
>
> x
[1] "I like three hundred forty six ice cream cones. They're three hundred forty six percent good! I ate three hundred forty six."
我需要对数字元素进行一些编辑,因为有一些缺少的换行符搞砸了解析(我在此演示下面包含了成功的版本:
> m <- gregexpr("[0-9]+", x)
> sym <- regmatches(x,m)
> regmatches(x,m) <- numbers2words(as.numeric(sym[[1]]))
>
> x
[1] "I like three hundred forty six ice cream cones. They're three hundred forty six percent good! I ate three hundred forty six."
Fox的功能是从http://tolstoy.newcastle.edu.au/R/help/05/04/2715.html
编辑的numbers2words <- function(x){
helper <- function(x){
digits <- rev(strsplit(as.character(x), "")[[1]])
nDigits <- length(digits)
if (nDigits == 1) as.vector(ones[digits])
else if (nDigits == 2)
if (x <= 19) as.vector(teens[digits[1]])
else trim(paste(tens[digits[2]],
Recall(as.numeric(digits[1]))))
else if (nDigits == 3) trim(paste(ones[digits[3]], "hundred",
Recall(makeNumber(digits[2:1]))))
else {
nSuffix <- ((nDigits + 2) %/% 3) - 1
if (nSuffix > length(suffixes)) stop(paste(x, "is too large!"))
trim(paste(Recall(makeNumber(digits[
nDigits:(3*nSuffix + 1)])),
suffixes[nSuffix],
Recall(makeNumber(digits[(3*nSuffix):1]))))
}
}
trim <- function(text){
gsub("^\ ", "", gsub("\ *$", "", text))
}
makeNumber <- function(...) as.numeric(paste(..., collapse=""))
opts <- options(scipen=100)
on.exit(options(opts))
ones <- c("", "one", "two", "three", "four", "five", "six", "seven",
"eight", "nine")
names(ones) <- 0:9
teens <- c("ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
"sixteen", " seventeen", "eighteen", "nineteen")
names(teens) <- 0:9
tens <- c("twenty", "thirty", "forty", "fifty", "sixty",
"seventy", "eighty", "ninety")
names(tens) <- 2:9
x <- round(x)
suffixes <- c("thousand", "million", "billion", "trillion")
if (length(x) > 1) return(sapply(x, helper))
helper(x)
}