我有一个data.table基础。 我在这个data.table
中有一个术语栏目class(base$term)
[1] character
length(base$term)
[1] 27486
我能够从字符串中删除重音符号。 我能够从字符串向量中删除重音符。
iconv("Millésime",to="ASCII//TRANSLIT")
[1] "Millesime"
iconv(c("Millésime","boulangère"),to="ASCII//TRANSLIT")
[1] "Millesime" "boulangere"
但由于某些原因,当我在我的术语栏上应用相同的功能时,它不起作用
base$terme[2]
[1] "Millésime"
iconv(base$terme[2],to="ASCII//TRANSLIT")
[1] "MillACsime"
有人知道这里发生了什么吗?
答案 0 :(得分:12)
确定解决问题的方法:
Encoding(base$terme[2])
[1] "UTF-8"
iconv(base$terme[2],from="UTF-8",to="ASCII//TRANSLIT")
[1] "Millesime"
感谢@nicola
答案 1 :(得分:2)
使用stringi软件包可能会更容易。这样,您无需事先检查编码。此外,stringi在操作系统之间是一致的,而inconv
在操作系统之间是不一致的。
library(stringi)
base <- data.table(terme = c("Millésime",
"boulangère",
"üéâäàåçêëèïîì"))
base[, terme := stri_trans_general(str = terme,
id = "Latin-ASCII")]
> base
terme
1: Millesime
2: boulangere
3: ueaaaaceeeiii
答案 2 :(得分:2)
去除重音的三种方法 - 下面显示并相互比较。
要使用的数据:
dtCases <- fread("https://github.com/ishaberry/Covid19Canada/raw/master/cases.csv", stringsAsFactors = F )
dim(dtCases) # 751526 16
基准测试:
> system.time(dtCases [, city0 := health_region])
user system elapsed
0.009 0.001 0.012
> system.time(dtCases [, city1 := base::iconv (health_region, to="ASCII//TRANSLIT")]) # or ... iconv (health_region, from="UTF-8", to="ASCII//TRANSLIT")
user system elapsed
0.165 0.001 0.200
> system.time(dtCases [, city2 := textclean::replace_non_ascii (health_region)])
user system elapsed
9.108 0.063 9.351
> system.time(dtCases [, city3 := stringi::stri_trans_general (health_region,id = "Latin-ASCII")])
user system elapsed
4.34 0.00 4.46
结果:
> dtCases[city0!=city1, city0:city3] %>% unique
city0 city1 city2 city3
<char> <char> <char> <char>
1: Montréal Montreal Montreal Montreal
2: Montérégie Monteregie Monteregie Monteregie
3: Chaudière-Appalaches Chaudiere-Appalaches Chaudiere-Appalaches Chaudiere-Appalaches
4: Lanaudière Lanaudiere Lanaudiere Lanaudiere
5: Nord-du-Québec Nord-du-Quebec Nord-du-Quebec Nord-du-Quebec
6: Abitibi-Témiscamingue Abitibi-Temiscamingue Abitibi-Temiscamingue Abitibi-Temiscamingue
7: Gaspésie-Îles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine Gaspesie-Iles-de-la-Madeleine
8: Côte-Nord Cote-Nord Cote-Nord Cote-Nord
结论:
base::iconv()
是最快且首选的方法。
测试法语单词。未在其他语言上测试。
答案 3 :(得分:-1)
您可以应用此功能
rm_accent <- function(str,pattern="all") {
if(!is.character(str))
str <- as.character(str)
pattern <- unique(pattern)
if(any(pattern=="Ç"))
pattern[pattern=="Ç"] <- "ç"
symbols <- c(
acute = "áéíóúÁÉÍÓÚýÝ",
grave = "àèìòùÀÈÌÒÙ",
circunflex = "âêîôûÂÊÎÔÛ",
tilde = "ãõÃÕñÑ",
umlaut = "äëïöüÄËÏÖÜÿ",
cedil = "çÇ"
)
nudeSymbols <- c(
acute = "aeiouAEIOUyY",
grave = "aeiouAEIOU",
circunflex = "aeiouAEIOU",
tilde = "aoAOnN",
umlaut = "aeiouAEIOUy",
cedil = "cC"
)
accentTypes <- c("´","`","^","~","¨","ç")
if(any(c("all","al","a","todos","t","to","tod","todo")%in%pattern)) # opcao retirar todos
return(chartr(paste(symbols, collapse=""), paste(nudeSymbols, collapse=""), str))
for(i in which(accentTypes%in%pattern))
str <- chartr(symbols[i],nudeSymbols[i], str)
return(str)
}