此问题与之前的one有关如何使用等效的México
代码Latex
替换M\'{e}xico
等重音字符串。
我的问题略有不同。我正在使用带有字符串变量的第三方数据库,如上所述。但是,编码看起来很奇怪,因为这是我得到的行为:
> grep("México",temp$dest_nom_ent)
integer(0)
> grep("Mexico",temp$dest_nom_ent)
integer(0)
> grep("xico",temp$dest_nom_ent)
[1] 18 19 20
> temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
[2] "México" "México" "México"
其中temp$dest_nom_ent
是状态名称为México的变量。
那么,我的问题是如何将第三方数据库中的字符串变量转换为标准R
函数将识别的编码。请注意:
> Encoding(temp$dest_nom_ent)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[8] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[15] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[22] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[29] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[36] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[43] "unknown" "unknown"
有关详细信息,我使用的是Windows 7 64.另请注意:
> charToRaw(temp$dest_nom_ent[18])
[1] 4d e9 78 69 63 6f
此source中的哪一个与Windows西班牙语(繁体排序)区域设置一致。
M=4d
é=e9
x=78
i=69
c=63
o=6f
还要注意:
> charToRaw("México")
[1] 4d c3 a9 78 69 63 6f
> Encoding("México")
[1] "latin1"
我尝试了以下失败的成功(例如,意思是grep("é",temp$dest_nom_ent)
返回null vector):
Encoding(temp$dest_nom_ent)<-"latin1"
temp$dest_nom_ent <- iconv(temp$dest_nom_ent,"","latin1")
temp$dest_nom_ent <- enc2utf8(temp$dest_nom_ent)
...
我使用iconvlist()
检查了支持的字符集,并支持"WINDOWS-1252"
。但是,以下内容不起作用:
> temp1 <- temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
> temp1
[1] "México" "México" "México"
> Encoding(temp1)<-"WINDOWS-1252"
> temp1 <- iconv(temp1,"WINDOWS-1252","latin1")
> temp1
[1] "México" "México" "México"
> Encoding(temp1)
[1] "latin1" "latin1" "latin1"
> charToRaw(temp1[1])
[1] 4d e9 78 69 63 6f
> grep("é",temp1)
integer(0)
与之相比:
> temp2 <- c("México","México","México")
> temp2
[1] "México" "México" "México"
> Encoding(temp2)
[1] "latin1" "latin1" "latin1"
> charToRaw(temp2[1])
[1] 4d c3 a9 78 69 63 6f
> grep("é",temp2)
[1] 1 2 3)
试图通过蛮力找出编码,如:
try(for(i in 1:length(iconvlist())){
temp1 <- temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
Encoding(temp1)<-iconvlist()[i]
temp1 <- iconv(temp1,iconvlist()[i],"latin1")
print(grep("é",temp1))
print(i)
},silent=FALSE)
我对try
函数并不熟悉,但它仍然存在错误而不是忽略它,因此无法检查整个列表:
...
[1] 17
integer(0)
[1] 18
integer(0)
[1] 19
integer(0)
[1] 20
Error in iconv(temp1, iconvlist()[i], "latin1") :
unsupported conversion from 'CP-GR' to 'latin1' in codepage 1252
最后:
> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> d<-c("México","México")
> for(i in 1:7){d1 <- str_sub(d[1],i,i); print(d1)}
[1] "M"
[1] "Ã"
[1] "©
[1] "x"
[1] "i"
[1] "c"
[1] "o"
> print(grep("é",d))
[1] 1 2
所以我似乎必须按照建议here更改计算机的语言环境。另请参阅here
PS:如果您想知道如何使用English_United States.1252语言环境我设法键入d<-c("México","México")
,方法是使用Control Panel > Clock, Language and Region > Region and Language > Keyboards and Languages > Change Keyboards
并在{{installed services
下设置辅助西班牙语键盘(传统排序) 1}}单击添加并导航到西班牙传统排序。然后在advanced key settings
下,您可以创建一个快捷方式来切换键盘。就我而言Shit+Alt
。因此,如果我想在英语默认语言环境中键入ñ
,我会Shift+Alt
后跟;
,然后Shift+Alt
返回英语键盘。
答案 0 :(得分:1)
使用temp$dest_nom_ent
查看Encoding(x)
和“México”的编码内容。您可能需要使用enc2native
或enc2utf8
进行转换。
答案 1 :(得分:0)
尝试将字符串的编码设置为“ISO_8859-1”“ISO_8859-15”之一。
还有两个建议......然后我放弃了:“UTF-16”“UTF-16LE”。第二个是UTF little-endian我相信并且已经读到它是Windows 7实际使用的。不妨尝试“UTF-16BE”。 (材料来自另一个stackexchange发布; https://superuser.com/questions/221593/windows-7-utf-8-and-unicode)
答案 2 :(得分:0)
好吧,我无法确定重音的编码,但以下内容可以实现我想要的效果。诀窍是转换为UTF-8,将sub()
选项useBytes=TRUE
和Joran的suggestion设置为sanitize.text.function=function(x){x}
使用xtable()
。这是示例代码。易于遍历所有重音元音:
> temp1 <- unique(temp$dest_nom_ent)
> temp1
[1] "Aguascalientes" "Baja California"
[3] "Baja California Sur" "Campeche"
[5] "Coahuila de Zaragoza" "Colima"
[7] "Chiapas" "Guanajuato"
[9] "Guerrero" "Hidalgo"
[11] "Jalisco" "México"
[13] "Michoacán de Ocampo" "Morelos"
[15] "Nayarit" "Oaxaca"
[17] "Puebla" "Querétaro"
[19] "Quintana Roo" "San Luis Potosí"
[21] "Sinaloa" "Tabasco"
[23] "Tlaxcala" "Veracruz de Ignacio de la Llave"
[25] "Zacatecas"
> temp1 <- iconv(unique(temp1),"","UTF-8")
> temp1
[1] "Aguascalientes" "Baja California"
[3] "Baja California Sur" "Campeche"
[5] "Coahuila de Zaragoza" "Colima"
[7] "Chiapas" "Guanajuato"
[9] "Guerrero" "Hidalgo"
[11] "Jalisco" "México"
[13] "Michoacán de Ocampo" "Morelos"
[15] "Nayarit" "Oaxaca"
[17] "Puebla" "Querétaro"
[19] "Quintana Roo" "San Luis Potosí"
[21] "Sinaloa" "Tabasco"
[23] "Tlaxcala" "Veracruz de Ignacio de la Llave"
[25] "Zacatecas"
> Encoding(temp1)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[8] "unknown" "unknown" "unknown" "unknown" "UTF-8" "UTF-8" "unknown"
[15] "unknown" "unknown" "unknown" "UTF-8" "unknown" "UTF-8" "unknown"
[22] "unknown" "unknown" "unknown" "unknown"
> temp2 <- sub("é", "\\\\'{e}", temp1, useBytes = TRUE)
> temp2 <- data.frame(temp2)
> print(xtable(temp2),sanitize.text.function=function(x){x})
% latex table generated in R 2.13.1 by xtable 1.5-6 package
% Fri Jul 15 13:52:44 2011
\begin{table}[ht]
\begin{center}
\begin{tabular}{rl}
\hline
& temp2 \\
\hline
1 & Aguascalientes \\
2 & Baja California \\
3 & Baja California Sur \\
4 & Campeche \\
5 & Coahuila de Zaragoza \\
6 & Colima \\
7 & Chiapas \\
8 & Guanajuato \\
9 & Guerrero \\
10 & Hidalgo \\
11 & Jalisco \\
12 & M\'{e}xico \\
13 & Michoacán de Ocampo \\
14 & Morelos \\
15 & Nayarit \\
16 & Oaxaca \\
17 & Puebla \\
18 & Quer\'{e}taro \\
19 & Quintana Roo \\
20 & San Luis Potosí \\
21 & Sinaloa \\
22 & Tabasco \\
23 & Tlaxcala \\
24 & Veracruz de Ignacio de la Llave \\
25 & Zacatecas \\
\hline
\end{tabular}
\end{center}
\end{table}
实际上是在循环中实现的:
temp$dest_nom_ent <- iconv(
temp$dest_nom_ent,"","UTF-8")
temp$dest_nom_mun <- iconv(
temp$dest_nom_mun,"","UTF-8")
accents <-c("á","é","í","ó","ú")
latex <-c("\\\\'{a}","\\\\'{e}","\\\\'{i}","\\\\'{o}","\\\\'{u}")
for(i in 1:5){
temp$dest_nom_ent<-sub(accents[i], latex[i],
temp$dest_nom_ent, useBytes = TRUE)
temp$dest_nom_mun<-sub(accents[i], latex[i],
temp$dest_nom_ent, useBytes = TRUE)
}
capture.output(
print(xtable(temp),sanitize.text.function=function(x){x}),
file = "../paper/rTables.tex", append = FALSE)
然而,答案是不完整的,因为我无法解释到底发生了什么。通过反复试验找到它。