R中的数据格式

时间:2014-05-06 16:45:42

标签: r ggplot2 time-series

我的数据格式如下

Wed Nov 13 21:32:22 GMT 2013
Unique  1011266
back    471693  46.6438%
edge    82093   8.1178%
Thu Nov 14 13:17:02 GMT 2013
Unique  1030845
back    479623  46.5271%
edge    91870   8.9121%
Fri Nov 15 13:17:01 GMT 2013
Unique  1012254
back    455858  45.0339%
edge    69738   6.8893%
Sat Nov 16 13:17:01 GMT 2013
Unique  1030938
back    473239  45.9037%
edge    107645  10.4414%
Sun Nov 17 13:17:01 GMT 2013
Unique  1012122
back    486244  48.0420%
edge    131616  13.0039%
Mon Nov 18 13:17:01 GMT 2013
Unique  1090236
back    489005  44.8531%
edge    118735  10.8907%
Tue Nov 19 13:17:01 GMT 2013
Unique  1054120
back    477180  45.2680%
edge    89535   8.4938%

我正在考虑使用ggplot绘制这个作为时间序列,即绘图日期与边缘和日期对比。每行中后退和边缘的值是其值和百分比,但是无法将其转换为列格式,因此无法转换为数据框。对此的任何帮助都会很棒.....

想要的输出是:

Date       unique  back   edge
2013-11-13 1011266 471693 82093
2013-11-14 1030845 479623 91870

2 个答案:

答案 0 :(得分:2)

您想在此处使用read.fwf

dat <- read.fwf(file='file.txt', 
         width=list(28,c(6,-2,7),c(4,-4,6,-2,8),c(4,-4,5,-2,7)))

基本上,您只需指定widths参数即可。当多行构成一个案例时,这是一个列表,其中每个元素对应于每行中字段的宽度。每条记录有四行,因此您有一个包含四个向量的列表。负数用于字段之间的空格。

结果如下:

> dat
                            V1     V2      V3   V4     V5       V6   V7    V8       V9
1 Wed Nov 13 21:32:22 GMT 2013 Unique 1011266 back 471693 46.6438% edge 82093  8.1178%
2 Thu Nov 14 13:17:02 GMT 2013 Unique 1030845 back 479623 46.5271% edge 91870  8.9121%
3 Fri Nov 15 13:17:01 GMT 2013 Unique 1012254 back 455858 45.0339% edge 69738  6.8893%
4 Sat Nov 16 13:17:01 GMT 2013 Unique 1030938 back 473239 45.9037% edge 10764  10.4414
5 Sun Nov 17 13:17:01 GMT 2013 Unique 1012122 back 486244 48.0420% edge 13161  13.0039
6 Mon Nov 18 13:17:01 GMT 2013 Unique 1090236 back 489005 44.8531% edge 11873  10.8907
7 Tue Nov 19 13:17:01 GMT 2013 Unique 1054120 back 477180 45.2680% edge 89535  8.4938%

我想你以后可能想要转换它并指定名称:

setNames(dat[,c(1,3,5,6,8,9)], 
         c('Date','Unique','back','backpercent','edge','edgepercent'))

您最初也可以指定不同的widths来跳过变量标签(唯一,边缘,背面等):

dat <- read.fwf(file='file.txt', 
         width=list(28,c(-8,7),c(-8,6,-2,8),c(-8,5,-2,9)),
         col.names=c('Date','Unique','back','backpercent','edge','edgepercent'))
dat
                          Date  Unique   back backpercent  edge edgepercent
1 Wed Nov 13 21:32:22 GMT 2013 1011266 471693    46.6438% 82093     8.1178%
2 Thu Nov 14 13:17:02 GMT 2013 1030845 479623    46.5271% 91870     8.9121%
3 Fri Nov 15 13:17:01 GMT 2013 1012254 455858    45.0339% 69738     6.8893%
4 Sat Nov 16 13:17:01 GMT 2013 1030938 473239    45.9037% 10764    10.4414%
5 Sun Nov 17 13:17:01 GMT 2013 1012122 486244    48.0420% 13161    13.0039%
6 Mon Nov 18 13:17:01 GMT 2013 1090236 489005    44.8531% 11873    10.8907%
7 Tue Nov 19 13:17:01 GMT 2013 1054120 477180    45.2680% 89535     8.4938%

然后,您可以轻松地将Date列转换为POSIXct并随意执行任何操作:

as.POSIXct(as.character(dat$Date), format='%a %b %d %H:%M:%S GMT %Y', tz='GMT')

答案 1 :(得分:1)

我不知道你的数据是什么格式,但是我们说它是某种文本文件:

cat('Wed Nov 13 21:32:22 GMT 2013
Unique  1011266
back    471693  46.6438%
edge    82093   8.1178%
Thu Nov 14 13:17:02 GMT 2013
Unique  1030845
back    479623  46.5271%
edge    91870   8.9121%
Fri Nov 15 13:17:01 GMT 2013
Unique  1012254
back    455858  45.0339%
edge    69738   6.8893%
Sat Nov 16 13:17:01 GMT 2013
Unique  1030938
back    473239  45.9037%
edge    107645  10.4414%
Sun Nov 17 13:17:01 GMT 2013
Unique  1012122
back    486244  48.0420%
edge    131616  13.0039%
Mon Nov 18 13:17:01 GMT 2013
Unique  1090236
back    489005  44.8531%
edge    118735  10.8907%
Tue Nov 19 13:17:01 GMT 2013
Unique  1054120
back    477180  45.2680%
edge    89535   8.4938%\n', file='temp.txt')

raw <- readLines('temp.txt')

unique <- sapply(grep('Unique',raw,value=T),function(x) unlist(strsplit(x,' '))[3] )
back <- sapply(grep('back',raw,value=T),function(x) unlist(strsplit(x,' '))[5] )
edge <- sapply(grep('edge',raw,value=T),function(x) unlist(strsplit(x,' '))[5] )
dates <- as.POSIXct(sapply(grep('GMT',raw,value=T),function(x) 
                   as.POSIXct(strptime(gsub('GMT','',x),'%a %b %d %H:%M:%S %Y'))),origin=origin)

# now make a data frame
dat <- data.frame(unique,back,edge,dates, row.names=NULL)

   dat
#    unique   back   edge               dates
# 1 1011266 471693  82093 2013-11-13 21:32:22
# 2 1030845 479623  91870 2013-11-14 13:17:02
# 3 1012254 455858  69738 2013-11-15 13:17:01
# 4 1030938 473239 107645 2013-11-16 13:17:01
# 5 1012122 486244 131616 2013-11-17 13:17:01
# 6 1090236 489005 118735 2013-11-18 13:17:01
# 7 1054120 477180  89535 2013-11-19 13:17:01

# now plot
ggplot(dat,aes(x=dates,y=edge)) + geom_point() + scale_x_datetime() + theme_bw()
ggplot(dat,aes(x=dates,y=back)) + geom_point() + scale_x_datetime() + theme_bw()