我有一个非常顽固的问题...我似乎无法删除已从<+f0b7>
文件加载的Corpora中的<+f0a0>
和*.txt
字符串进入R:
更新以下是示例.txt
文件的链接: https://db.tt/qTRKpJYK
Corpus(DirSource("./SomeDirectory/txt/"), readerControl = list(reader = readPlain))
title
professional staff - contract - permanent position
software c microfocus cobol unix btrieve ibm vm-cms vsam cics jcl
accomplishments
<+f0b7>
<+f0a0>
responsible maintaining billing system interfaced cellular switching system <+f0b7>
<+f0a0>
developed unix interface ibm mainframe ericsson motorola att cellular switches
我已尝试将其添加到:
badWords <- unique(c(stopwords("en"),
stopwords("SMART")[stopwords("SMART") != "c"],
as.character(1970:2050),
"<U+F0B7>", "<+f0b7>",
"<U+F0A0>", "<+f0a0>",
"january", "jan",
"february", "feb",
"march", "mar",
"april", "apr",
"may", "may",
"june", "jun",
"july", "jul",
"august", "aug",
"september", "sep",
"october", "oct",
"november", "nov",
"december", "dec"))
使用:
tm_map(candidates.Corpus, removeWords, badWords)
但那并不能以某种方式起作用。我也尝试用gsub("<+f0a0>", "", tmp, perl = FALSE)
之类的东西来重复它,并且它在R中的字符串上工作,但是当我读取.txt
文件时,这些字符仍然会出现。
这些角色有什么独特之处吗?我如何摆脱它们?
答案 0 :(得分:1)
确定。问题是你的数据中有一个不寻常的unicode字符。在R中,我们通常将此字符转义为&#34; \ uf0b7&#34;。但是当inspect()
打印它的数据时,它会将其编码为&#34;&#34;。观察
sample<-c("Crazy \uf0b7 Character")
cp<-Corpus(VectorSource(sample))
inspect(DocumentTermMatrix(cp))
# A document-term matrix (1 documents, 3 terms)
#
# Non-/sparse entries: 3/0
# Sparsity : 0%
# Maximal term length: 9
# Weighting : term frequency (tf)
#
# Terms
# Docs <U+F0B7> character crazy
# 1 1 1 1
(实际上我必须在运行R 3.0.2的Windows机器上创建此输出 - 它在运行R 3.1.0的Mac上运行良好。)
不幸的是,你无法删除单词,因为该函数中使用的正则表达式要求单词边界出现在&#34; word&#34;因为这似乎不是边界的公认角色。参见
gsub("\uf0b7","",sample)
# [1] "Crazy Character"
gsub("\\b\uf0b7\\b","",sample)
#[1] "Crazy Character"
因此我们可以编写自己的函数,我们可以使用tm_map
。考虑
removeCharacters <-function (x, characters) {
gsub(sprintf("(*UCP)(%s)", paste(characters, collapse = "|")), "", x, perl = TRUE)
}
基本上就是没有边界条件的removeWords函数。然后我们可以运行
cp2 <- tm_map(cp, removeCharacters, c("\uf0b7","\uf0a0"))
inspect(DocumentTermMatrix(cp2))
# A document-term matrix (1 documents, 2 terms)
#
# Non-/sparse entries: 2/0
# Sparsity : 0%
# Maximal term length: 9
# Weighting : term frequency (tf)
#
# Terms
# Docs character crazy
# 1 1 1
我们看到那些unicode字符不再存在。