规范化和清理R

时间:2016-12-20 09:05:05

标签: r excel

我需要帮助清理R中的Excel文件。

这些是由不同的人制作的excel文件,它们应该包含相同的相同文本。我的任务是比较文本的片段,并确保它们匹配(有时人们输入,有时人们复制粘贴,这是一团糟)。

我的特殊问题是没有标准格式,其中一些是从PDF中提取的。

为了给你一个想法,文字可能如下所示:

文件A:

“猴子必须吃:

·香蕉,或

·6个香蕉。“

文件B:

“猴子必须吃:

香蕉,或

5个香蕉。“

文件C:

“猴子必须吃:

·香蕉,或

·6个香蕉。“

到目前为止,我已经使用了以下功能的组合,但最后我的比较仍然是 FALSE

monkeyr$txtcp <- stri_enc_toascii(monkeyr$txtcp)

monkeyr$txtcp <- removeNumbers(monkeyr$txtcp) 
#bad idea as I want to compare the number of bananas

monkeyr$txtcp <- tolower(monkeyr$txtcp)

monkeyr$txtcp <- stripWhitespace(monkeyr$txtcp)

monkeyr$txtcp <- removePunctuation(monkeyr$txtcp)

monkeyr$txtcp <- trimws(monkeyr$txtcp)

monkeyr$txtcp <- stri_replace_all_charclass(monkeyr$txtcp,
 "\t", " ", merge = T)

#This above was specially because I wanted to remove the "tab" in File 3.
#Does not work. This is some sort of "invisible" tab that gets turned into
#a series of ->->-> when saved as csv.

其他信息:

这是FILE C在excel中被剥离和打开后的样子:

Screenshot of "invisible" tabs, which turn into arrows

有什么建议将文本规范化吗?

警告:没有使用Java的软件包。

提前致谢

修改

示例输入:

monkeyr <- structure(list(id = c("MON1", "MON2", "MON3"), txtcp = c("The monkey must be fed a combination of:\r\n<U+F0B7>     Bananas, or\r\n<U+F0B7>     6 Bananas.", 
"The monkey must:\r\n·   Be active\r\n·   Be petted\r\n·   Be inactive.", 
"The monkey must:\r\nbe tame\r\njump")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -3L), .Names = c("id", 
"txtcp"))

预期输出:

cleanmonkey <- structure(list(id = c("MON1", "MON2", "MON3"), txtcp = c("the monkey must be fed a combination of bananas or 6 bananas", 
"the monkey must be active be petted be inactive", "the monkey must be tame jump"
)), .Names = c("id", "txtcp"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -3L))

1 个答案:

答案 0 :(得分:1)

没什么优雅的,但是这个怎么样。代码替换非ASCII元素,然后是“\ r”和“\ n”,最后清理多余的空格。

monkeyr$clean <- str_replace_all(string = monkeyr$txtcp, pattern = "<U.*>", replacement = "")
monkeyr$clean <- str_replace_all(string = monkeyr$clean, pattern = "\\\r", replacement = "")
monkeyr$clean <- str_replace_all(string = monkeyr$clean, pattern = "\\\n", replacement = "")
monkeyr$clean <- str_replace_all(string = monkeyr$clean, pattern = "[[:punct:]]", replacement = "")
monkeyr$clean <- str_replace_all(string = monkeyr$clean, pattern = "\\s{2}", replacement = "")

monkeyr$clean
[1] "The monkey must be fed a combination of Bananas or 6 Bananas"
[2] "The monkey must Be active Be petted Be inactive"             
[3] "The monkey mustbe tamejump" 

应注意“mustbe”和“tamejump”的组合。