是否有人知道r中可以将ä
转换为其unicode字符â
的泛型函数?我看到了â
中的一些函数,并将其转换为普通字符。任何帮助,将不胜感激。谢谢。
编辑:下面是数据记录,我可能有超过100万条记录。除了将数据读入大型向量之外,是否有更简单的解决方案,对于每个元素,更改记录?
wine/name: 1999 Domaine Robert Chevillon Nuits St. Georges 1er Cru Les Vaucrains
wine/wineId: 43163
wine/variant: Pinot Noir
wine/year: 1999
review/points: N/A
review/time: 1337385600
review/userId: 1
review/userName: Eric
review/text: Well this is awfully gorgeous, especially with a nicely grilled piece of Copper River sockeye. Pine needle and piercing perfume move to a remarkably energetic and youthful palate of pure, twangy, red fruit. Beneath that is a fair amount of umami and savory aspect with a surprising amount of tannin. Lots of goodness here. Still quite young but already rewarding at this stage.
wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Spätlese
wine/wineId: 3058
wine/variant: Riesling
wine/year: 2001
review/points: N/A
review/time: 1095120000
review/userId: 1
review/userName: Eric
review/text: Hideously corked!
更新:
使用函数stri_trans_general函数会将任何Â
转换为正确的小写字符,并且需要分配vapply结果以保存更改。
#cellartracker-10records is the test file to use
tester <- "/Users/petergensler/Desktop/Wine Analysis/cellartracker-10records.txt"
decode <- function(x) { xmlValue(getNodeSet(htmlParse(tester), "//p")[[1]]) }
#Using vector, as we want to iterate over the raw file for cleaning
poop <- vapply(tester, decode, character(1), USE.NAMES = FALSE)
#Now use stringi to convert all characters to correct characters poop
poop <- stringi::stri_trans_general(poop, "Latin-ASCII")
writeLines(poop, "wines.txt")
答案 0 :(得分:3)
这是通过 XML 包的一种方式:
txt <- "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"
library("XML")
xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])
> xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])
[1] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"
[[1]]
位是因为getNodeSet()
返回已解析元素的列表,即使这里只有一个元素也是如此。
这是在2010年从reply to the R-Help list by Henrique Dallazuanna获得/修改的。
如果你想为长度> 1的字符向量运行它,那么lapply()
这个:
txt <- rep(txt, 2)
decode <- function(x) {
xmlValue(getNodeSet(htmlParse(x, asText = TRUE), "//p")[[1]])
}
lapply(txt, decode)
或者如果你想将它作为矢量,vapply()
:
> vapply(txt, decode, character(1), USE.NAMES = FALSE)
[1] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"
[2] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"
对于多行示例,请使用原始版本,但如果您希望再次将其作为多行文档,则必须将字符向量写回文件:
txt <- "wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg
Riesling Spätlese
wine/wineId: 3058
wine/variant: Riesling
wine/year: 2001
review/points: N/A
review/time: 1095120000
review/userId: 1
review/userName: Eric
review/text: Hideously corked!"
out <- xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])
这给了我
> out
[1] "wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg \nRiesling Spätlese\nwine/wineId: 3058\nwine/variant: Riesling\nwine/year: 2001\nreview/points: N/A\nreview/time: 1095120000\nreview/userId: 1\nreview/userName: Eric\nreview/text: Hideously corked!"
如果您使用writeLines()
writeLines(out, "wines.txt")
您将获得一个文本文件,可以使用其他解析代码再次读入该文件:
> readLines("wines.txt")
[1] "wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg "
[2] "Riesling Spätlese"
[3] "wine/wineId: 3058"
[4] "wine/variant: Riesling"
[5] "wine/year: 2001"
[6] "review/points: N/A"
[7] "review/time: 1095120000"
[8] "review/userId: 1"
[9] "review/userName: Eric"
[10] "review/text: Hideously corked!"
它是一个文件(来自我的BASH终端)
$ cat wines.txt
wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg
Riesling Spätlese
wine/wineId: 3058
wine/variant: Riesling
wine/year: 2001
review/points: N/A
review/time: 1095120000
review/userId: 1
review/userName: Eric
review/text: Hideously corked!