对齐和交替的字符串

时间:2012-11-06 17:42:13

标签: r text

我一直致力于提供数据的文本分析。通常分析包括在纸上编码成绩单,然后将信息作为数字代码导入R.我想输出单词的成绩单,其上面的单词数被切割成一定的线宽(让我们使用任意80个字符)。

最小的可视化示例:

#what we start with:

   person   text word.num
1    greg    The        1
2    greg    dog        2
3    greg   went        3
4    greg     to        4
5    greg    the        5
6    greg   zoo,        6
7    greg    but        7
8    greg    ate        8
9    greg first.        9
10  sally     He       10
11  sally  likes       11
12  sally  water       12
13  sally      a       13
14  sally    bit       14
15  sally   too.       15

#what我喜欢:

1   2   3    4  5   6
The dog went to the zoo, 

7   8   9      10 11     
but ate first. He likes   

12    13  14  15
water a   bit too.  

当数字变大时会出现另一个问题,即较大的单词数可能超过一个短单词,而单词需要在其前面放置一个额外的空格。我认为通过确定最大数字的最大字符(数字)并在小于此数量的单词后添加那么多空格,这在粘贴过程中很容易做到。

我到目前为止解决这个问题的想法是:

  1. 为字符向量创建1列矩阵,每行具有特定的最大长度(strwrap在此处可能有用)
  2. 如上所述,在短字后添加额外的空格(nchargsub在这里可能会有用)
  3. 使用字数统计函数确定伴随矩阵的数值,然后使用cumsumseq生成一个数值(实际字符)的矩阵,也就是1列。这将匹配行与字符(单词)矩阵。
  4. 现在两个矩阵需要逐行交替(不知道怎么做)
  5. 对齐单词上方的数字(不知道如何执行此操作,但此处nchar可能有用)
  6. 我希望将此保留在基本工具中,但我确定Hadely的stringR会有用我想避免这种依赖。

    上面的

    dput数据:

     dat <- structure(list(person = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,                           
         1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("greg", "sally"), class = "factor"),             
             text = structure(c(10L, 5L, 14L, 11L, 9L, 15L, 4L, 2L, 6L,                               
             7L, 8L, 13L, 1L, 3L, 12L), .Label = c("a", "ate", "bit",                                 
             "but", "dog", "first.", "He", "likes", "the", "The", "to",                               
             "too.", "water", "went", "zoo,"), class = "factor"), word.num = 1:15), row.names = c(NA, 
         -15L), .Names = c("person", "text", "word.num"), class = "data.frame")  
    

    我无法设计一个标题,我觉得这个标题可以被搜索到未来的SO用户。请建议编辑...

3 个答案:

答案 0 :(得分:3)

> datmat <- matrix(c(1:length(dat$text), as.character(dat$text) ), nrow=2, byrow=TRUE)
> datmat
     [,1]  [,2]  [,3]   [,4] [,5]  [,6]   [,7]  [,8]  [,9]     [,10] [,11]   [,12]   [,13] [,14] [,15] 
[1,] "1"   "2"   "3"    "4"  "5"   "6"    "7"   "8"   "9"      "10"  "11"    "12"    "13"  "14"  "15"  
[2,] "The" "dog" "went" "to" "the" "zoo," "but" "ate" "first." "He"  "likes" "water" "a"   "bit" "too."
> options(width=30)
> datmat
     [,1]  [,2]  [,3]   [,4]
[1,] "1"   "2"   "3"    "4" 
[2,] "The" "dog" "went" "to"
     [,5]  [,6]   [,7]  [,8] 
[1,] "5"   "6"    "7"   "8"  
[2,] "the" "zoo," "but" "ate"
     [,9]     [,10] [,11]  
[1,] "9"      "10"  "11"   
[2,] "first." "He"  "likes"
     [,12]   [,13] [,14]
[1,] "12"    "13"  "14" 
[2,] "water" "a"   "bit"
     [,15] 
[1,] "15"  
[2,] "too."

可以通过强制转换为表格对象并使用print.table来删除引号:

> class(datmat) <- "table"
> datmat
     [,1] [,2] [,3] [,4] [,5]
[1,] 1    2    3    4    5   
[2,] The  dog  went to   the 
     [,6] [,7] [,8] [,9]  
[1,] 6    7    8    9     
[2,] zoo, but  ate  first.
     [,10] [,11] [,12] [,13]
[1,] 10    11    12    13   
[2,] He    likes water a    
     [,14] [,15]
[1,] 14    15   
[2,] bit   too. 

你也可以用这个做点什么。它解决了Gavin提到的左对齐问题:

> gsub("\\[.*\\,.*\\]", "", capture.output( print(datmat, quote=FALSE) ) )
 [1] "     "                    
 [2] " 1    2    3    4    5   "
 [3] " The  dog  went to   the "
 [4] "       "                  
 [5] " 6    7    8    9     "   
 [6] " zoo, but  ate  first."   
 [7] "     "                    
 [8] " 10    11    12    13   " 
 [9] " He    likes water a    " 
[10] "     "                    
[11] " 14    15   "             
[12] " bit   too. " 

还有一个进一步的改进:

datlines <- gsub("\\[.*\\,.*\\]", "", capture.output( print(datmat, quote=FALSE) ) )
for( i in seq_along(datlines)){ cat(datlines[i], "\n") }
 #----------------------------------#
 1    2    3    4    5    
 The  dog  went to   the  

 6    7    8    9      
 zoo, but  ate  first. 

 10    11    12    13    
 He    likes water a     

 14    15    
 bit   too. 

答案 1 :(得分:3)

怎么样:

> tmp <- setNames(as.character(dat$text), dat$word.num)
> print(tmp, quote=FALSE)
     1      2      3      4    
 likes  water      a    bit   too.
> options(width = 80)
> print(tmp, quote=FALSE)
     1      2      3      4      5      6      7      8      9     10     11 
   The    dog   went     to    the   zoo,    but    ate first.     He  likes 
    12     13     14     15 
 water      a    bit   too. 

您可以在对象上粘贴自己的类并添加打印方法:

class(tmp) <- "foo"
print.foo <- function(x, quote = FALSE, ...) {
  print(unclass(x), quote = quote, ...)
}
tmp

> tmp
     1      2      3      4      5      6      7      8      9     10     11 
   The    dog   went     to    the   zoo,    but    ate first.     He  likes 
    12     13     14     15 
 water      a    bit   too.

将此表示转储到文件的一种方法是通过capture.output(),它有一个文件参数:

capture.output(tmp, file = "foo.txt")

生成的文本文件包含:

$ cat foo.txt 
     1      2      3      4      5      6      7      8      9     10     11 
   The    dog   went     to    the   zoo,    but    ate first.     He  likes 
 water      a    bit   too.
    12     13     14     15 

这不是你所拥有的 - 单词数字是右对齐的,但它很接近。

答案 2 :(得分:1)

为了完成线程,我使用DWin的解决方案和一些Gavin的方法(作为一个函数):

numbtext <- function(text.var, width=80, txt.file = NULL) {
    zz <- matrix(c(1:length(text.var), as.character(text.var) ), 
        nrow=2, byrow=TRUE)
    OW <- options()$width
    options(width=width)
    dimnames(zz) <- list(c(rep("", nrow(zz))), c(rep("", ncol(zz))))
    print(zz, quote = FALSE)
    if (!is.null(txt.file)){
        sink(file=txt.file, append = TRUE) 
        print(zz, quote = FALSE)
        sink()
    }
    options(width=OW)
}

numbtext(dat$text, 40, "foo.txt")

得到以下特性:

 1   2   3    4  5   6    7   8  
 The dog went to the zoo, but ate

 9      10 11    12    13 14  15  
 first. He likes water a  bit too.