我有一个从网上下载的字符串向量,我需要传递给XML,用rgexf包构建一个.gefx文件。
我已经确定了有问题的字符串,但经过几次尝试(见下文)仍然无法弄清楚如何使用正则表达式对其进行消毒。您可以在nodes
library(rgexf)
nodes <- data.frame(matrix(c("1","one",
"2","two",
"3","three",
"4","C//DTD XHTML 1.0 Transitional//EN\"\n \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\" id=\"sixapart-standard\">\n<head>\n <meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8"),
ncol=2,byrow=T))
edges <- data.frame(matrix(c("1","2",
"3","4",
"4","3",
"2","1"),
ncol=2,byrow=T))
# My attempts to sanitize the string
nodes[,2] <- gsub("<","",nodes[,2])
nodes[,2] <- gsub(">","",nodes[,2])
nodes[,2] <- gsub(""","\"",nodes[,2])
nodes[,2] <- gsub("=\"","",nodes[,2])
nodes[,2] <- gsub("EN\"\n","",nodes[,2])
write.gexf(nodes=nodes, edges=edges, output="test.gexf")
xml构建器的错误消息是
attributes construct error
Couldn't find end of Start Tag node line 1
Error: 1: attributes construct error
2: Couldn't find end of Start Tag node line 1
答案 0 :(得分:4)
您可以尝试使用XML包来正确地转义字符串:
library(XML)
string = "C//DTD XHTML 1.0 Transitional//EN\"\n \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\" id=\"sixapart-standard\">\n<head>\n <meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8"
as.character(XML::xmlTextNode(string))[6]
# [1] "C//DTD XHTML 1.0 Transitional//EN"\n "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" id="sixapart-standard">\n<head>\n <meta http-equiv="Content-Type" content="text/html; charset=utf-8"
修改强>: 具体地,
sanitizeForXml <- function (string) {
string <- as.character(XML::xmlTextNode(string))[6]
}
vector <- vapply(vector, sanitizeForXml, FUN.VALUE = character(1), USE.NAMES = FALSE)