我一直致力于提供数据的文本分析。通常分析包括在纸上编码成绩单,然后将信息作为数字代码导入R.我想输出单词的成绩单,其上面的单词数被切割成一定的线宽(让我们使用任意80个字符)。
最小的可视化示例:
#what we start with:
person text word.num
1 greg The 1
2 greg dog 2
3 greg went 3
4 greg to 4
5 greg the 5
6 greg zoo, 6
7 greg but 7
8 greg ate 8
9 greg first. 9
10 sally He 10
11 sally likes 11
12 sally water 12
13 sally a 13
14 sally bit 14
15 sally too. 15
#what我喜欢:
1 2 3 4 5 6
The dog went to the zoo,
7 8 9 10 11
but ate first. He likes
12 13 14 15
water a bit too.
当数字变大时会出现另一个问题,即较大的单词数可能超过一个短单词,而单词需要在其前面放置一个额外的空格。我认为通过确定最大数字的最大字符(数字)并在小于此数量的单词后添加那么多空格,这在粘贴过程中很容易做到。
我到目前为止解决这个问题的想法是:
strwrap
在此处可能有用)nchar
和gsub
在这里可能会有用)cumsum
和seq
生成一个数值(实际字符)的矩阵,也就是1列。这将匹配行与字符(单词)矩阵。nchar
可能有用)我希望将此保留在基本工具中,但我确定Hadely的stringR
会有用我想避免这种依赖。
dput
数据:
dat <- structure(list(person = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("greg", "sally"), class = "factor"),
text = structure(c(10L, 5L, 14L, 11L, 9L, 15L, 4L, 2L, 6L,
7L, 8L, 13L, 1L, 3L, 12L), .Label = c("a", "ate", "bit",
"but", "dog", "first.", "He", "likes", "the", "The", "to",
"too.", "water", "went", "zoo,"), class = "factor"), word.num = 1:15), row.names = c(NA,
-15L), .Names = c("person", "text", "word.num"), class = "data.frame")
我无法设计一个标题,我觉得这个标题可以被搜索到未来的SO用户。请建议编辑...
答案 0 :(得分:3)
> datmat <- matrix(c(1:length(dat$text), as.character(dat$text) ), nrow=2, byrow=TRUE)
> datmat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
[2,] "The" "dog" "went" "to" "the" "zoo," "but" "ate" "first." "He" "likes" "water" "a" "bit" "too."
> options(width=30)
> datmat
[,1] [,2] [,3] [,4]
[1,] "1" "2" "3" "4"
[2,] "The" "dog" "went" "to"
[,5] [,6] [,7] [,8]
[1,] "5" "6" "7" "8"
[2,] "the" "zoo," "but" "ate"
[,9] [,10] [,11]
[1,] "9" "10" "11"
[2,] "first." "He" "likes"
[,12] [,13] [,14]
[1,] "12" "13" "14"
[2,] "water" "a" "bit"
[,15]
[1,] "15"
[2,] "too."
可以通过强制转换为表格对象并使用print.table来删除引号:
> class(datmat) <- "table"
> datmat
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] The dog went to the
[,6] [,7] [,8] [,9]
[1,] 6 7 8 9
[2,] zoo, but ate first.
[,10] [,11] [,12] [,13]
[1,] 10 11 12 13
[2,] He likes water a
[,14] [,15]
[1,] 14 15
[2,] bit too.
你也可以用这个做点什么。它解决了Gavin提到的左对齐问题:
> gsub("\\[.*\\,.*\\]", "", capture.output( print(datmat, quote=FALSE) ) )
[1] " "
[2] " 1 2 3 4 5 "
[3] " The dog went to the "
[4] " "
[5] " 6 7 8 9 "
[6] " zoo, but ate first."
[7] " "
[8] " 10 11 12 13 "
[9] " He likes water a "
[10] " "
[11] " 14 15 "
[12] " bit too. "
还有一个进一步的改进:
datlines <- gsub("\\[.*\\,.*\\]", "", capture.output( print(datmat, quote=FALSE) ) )
for( i in seq_along(datlines)){ cat(datlines[i], "\n") }
#----------------------------------#
1 2 3 4 5
The dog went to the
6 7 8 9
zoo, but ate first.
10 11 12 13
He likes water a
14 15
bit too.
答案 1 :(得分:3)
怎么样:
> tmp <- setNames(as.character(dat$text), dat$word.num)
> print(tmp, quote=FALSE)
1 2 3 4
likes water a bit too.
> options(width = 80)
> print(tmp, quote=FALSE)
1 2 3 4 5 6 7 8 9 10 11
The dog went to the zoo, but ate first. He likes
12 13 14 15
water a bit too.
您可以在对象上粘贴自己的类并添加打印方法:
class(tmp) <- "foo"
print.foo <- function(x, quote = FALSE, ...) {
print(unclass(x), quote = quote, ...)
}
tmp
给
> tmp
1 2 3 4 5 6 7 8 9 10 11
The dog went to the zoo, but ate first. He likes
12 13 14 15
water a bit too.
将此表示转储到文件的一种方法是通过capture.output()
,它有一个文件参数:
capture.output(tmp, file = "foo.txt")
生成的文本文件包含:
$ cat foo.txt
1 2 3 4 5 6 7 8 9 10 11
The dog went to the zoo, but ate first. He likes
water a bit too.
12 13 14 15
这不是你所拥有的 - 单词数字是右对齐的,但它很接近。
答案 2 :(得分:1)
为了完成线程,我使用DWin的解决方案和一些Gavin的方法(作为一个函数):
numbtext <- function(text.var, width=80, txt.file = NULL) {
zz <- matrix(c(1:length(text.var), as.character(text.var) ),
nrow=2, byrow=TRUE)
OW <- options()$width
options(width=width)
dimnames(zz) <- list(c(rep("", nrow(zz))), c(rep("", ncol(zz))))
print(zz, quote = FALSE)
if (!is.null(txt.file)){
sink(file=txt.file, append = TRUE)
print(zz, quote = FALSE)
sink()
}
options(width=OW)
}
numbtext(dat$text, 40, "foo.txt")
得到以下特性:
1 2 3 4 5 6 7 8
The dog went to the zoo, but ate
9 10 11 12 13 14 15
first. He likes water a bit too.