如何按行和列查找>专栏名称?

时间:2017-04-29 10:43:12

标签: r dplyr

我正在考虑如何按大学名称(第一行:A,...,F)查找时间数据,字段名称(第一列:{{以下文件Acute中的1}},...,En)和/或毕业时间(time)。 我在考虑DS.csv方法,但无法通过三个变量将数字ID查找(线程回答How to overload function parameters in R?)扩展到查找。 挑战

  1. 如何按第一行查找?也许,类似于dplyr
  2. 如何将大学查询扩展为两列?伪代码$1 == "A"是关于最后两列的第二和第三列,...,$1 == "A"
  3. 通过3个查找标准进行查找:第一行(无标题),标题为$1 == "F"的第一列和标题Field。伪代码

    time
  4. 数据times <- getTimes($1 == "A", Field == "Ane", by = "desc(time)") 包含数据。第一列表示实验。以下数据采用交叉表格式,以便

    DS.csv

    以直表格式

    ,A,,B,,C,,D,,E,,F,
    Field,time,T,time,T,time,T,time,T,time,T,time,T
    Acute,0,0,8.3,1,7.5,1,8.6,2,0,0,8.3,4
    Ane,9,120,7.7,26,7.9,43,7.8,77,7.9,60,8.2,326
    En,15.6,2,12.9,1,0,0,0,0,14.3,1,14.6,4
    Fo,9.2,2,0,0,5.4,1,0,0,0,0,7.9,3
    

    伪代码

    Field,time,T,Experiment       
    Acut,0,0,A
    An,9,120,A
    En,15.6,2,A
    Fo,9.2,2,A
    Acute,8.3,1,B       
    An,7.7,26,B
    En,12.9,1,B
    Fo,0,0,B  
    Acute,7.5,1,C       
    An,7.9,43,C
    En,0,0,C  
    Fo,5.4,1,C
    Acute,8.6,2,D       
    An,7.8,77,D
    En,0,0,D  
    Fo,0,0,D  
    Acute,0,0,E         
    An,7.9,60,E
    En,14.3,1,E
    Fo,0,0,E  
    Acute,8.3,4,F       
    An,8.2,326,F
    En,14.6,4,F
    Fo,7.9,3,F
    

    预期输出:library('dplyr') ow <- options("warn") DF <- read.csv("/home/masi/CSV/DS.csv", header = T) # Lookup by first row, Lookup by Field, lookup by Field's first column? times <- getTimes($1 == "A", Field == "Ane", by = "desc(time)")
    预期输出已推广:9ab,...

    c

    R:3.3.3(2017-03-06)
    操作系统:Debian 8.7
    硬件:华硕Zenbook UX303UA

3 个答案:

答案 0 :(得分:3)

将您的初始原始数据作为起点:

# read the data & skip 1st & 2nd line which contain only header information
DF <- read.csv(text=",A,,B,,C,,D,,E,,F,
Field,time,T,time,T,time,T,time,T,time,T,time,T
Acute,0,0,8.3,1,7.5,1,8.6,2,0,0,8.3,4
Ane,9,120,7.7,26,7.9,43,7.8,77,7.9,60,8.2,326
En,15.6,2,12.9,1,0,0,0,0,14.3,1,14.6,4
Fo,9.2,2,0,0,5.4,1,0,0,0,0,7.9,3", header=FALSE, stringsAsFactors=FALSE, skip=2)

# read the first two lines which contain the header information
headers <- read.csv(text=",A,,B,,C,,D,,E,,F,
Field,time,T,time,T,time,T,time,T,time,T,time,T
Acute,0,0,8.3,1,7.5,1,8.6,2,0,0,8.3,4
Ane,9,120,7.7,26,7.9,43,7.8,77,7.9,60,8.2,326
En,15.6,2,12.9,1,0,0,0,0,14.3,1,14.6,4
Fo,9.2,2,0,0,5.4,1,0,0,0,0,7.9,3", header=FALSE, stringsAsFactors=FALSE, nrow=2)

# extract the university names for the 'headers' data.frame
universities <- unlist(headers[1,])
universities <- universities[universities != '']

# create column names from the 'headers' data.frame
vec <- headers[2,][headers[2,] == 'T']
headers[2,][headers[2,] == 'T'] <- paste0(vec, seq_along(vec))
names(DF) <- paste0(headers[2,],headers[1,])

您的数据框现在看起来如下:

> DF
   Field timeA  T1 timeB T2 timeC T3 timeD T4 timeE T5 timeF  T6
1: Acute   0.0   0   8.3  1   7.5  1   8.6  2   0.0  0   8.3   4
2:   Ane   9.0 120   7.7 26   7.9 43   7.8 77   7.9 60   8.2 326
3:    En  15.6   2  12.9  1   0.0  0   0.0  0  14.3  1  14.6   4
4:    Fo   9.2   2   0.0  0   5.4  1   0.0  0   0.0  0   7.9   3

最好将数据转换为长格式:

library(data.table)
DT <- melt(setDT(DF), id = 1, 
           measure.vars = patterns('^time','^T'),
           variable.name = 'university', 
           value.name = c('time','t')
           )[, university := universities[university]][]

现在您的数据如下:

> DT
    Field university time   t
 1: Acute          A  0.0   0
 2:   Ane          A  9.0 120
 3:    En          A 15.6   2
 4:    Fo          A  9.2   2
 5: Acute          B  8.3   1
 6:   Ane          B  7.7  26
 7:    En          B 12.9   1
 8:    Fo          B  0.0   0
 9: Acute          C  7.5   1
10:   Ane          C  7.9  43
11:    En          C  0.0   0
12:    Fo          C  5.4   1
13: Acute          D  8.6   2
14:   Ane          D  7.8  77
15:    En          D  0.0   0
16:    Fo          D  0.0   0
17: Acute          E  0.0   0
18:   Ane          E  7.9  60
19:    En          E 14.3   1
20:    Fo          E  0.0   0
21: Acute          F  8.3   4
22:   Ane          F  8.2 326
23:    En          F 14.6   4
24:    Fo          F  7.9   3

现在您可以选择所需的信息:

 DT[university == 'A' & Field == 'Ane']

给出:

   Field university time   t
1:   Ane          A    9 120

过滤数据的几个dplyr示例:

library(dplyr)
DT %>% 
  filter(Field=="En" & t > 1)

给出:

  Field university time t
1    En          A 15.6 2
2    En          F 14.6 4

或者:

DT %>%
  arrange(desc(time)) %>%
  filter(time < 14 & t > 3)

给出:

  Field university time   t
1   Ane          A  9.0 120
2 Acute          F  8.3   4
3   Ane          F  8.2 326
4   Ane          C  7.9  43
5   Ane          E  7.9  60
6   Ane          D  7.8  77
7   Ane          B  7.7  26

答案 1 :(得分:2)

更改交叉表

,A,,B,,C,,D,,E,,F,
Field,time,T,time,T,time,T,time,T,time,T,time,T
Acute,0,0,8.3,1,7.5,1,8.6,2,0,0,8.3,4
Ane,9,120,7.7,26,7.9,43,7.8,77,7.9,60,8.2,326
En,15.6,2,12.9,1,0,0,0,0,14.3,1,14.6,4
Fo,9.2,2,0,0,5.4,1,0,0,0,0,7.9,3

成直接数据格式

Field,time,T,Experiment
Acut,0,0,A
An,9,120,A
En,15.6,2,A
Fo,9.2,2,A
Acute,8.3,1,B
An,7.7,26,B
En,12.9,1,B
Fo,0,0,B
Acute,7.5,1,C
An,7.9,43,C
En,0,0,C
Fo,5.4,1,C
Acute,8.6,2,D
An,7.8,77,D
En,0,0,D
Fo,0,0,D
Acute,0,0,E
An,7.9,60,E
En,14.3,1,E
Fo,0,0,E
Acute,8.3,4,F
An,8.2,326,F
En,14.6,4,F
Fo,7.9,3,F

我使用Vim.csv插件和可视块模式。

多种方式进行选择

在将数据整理成易于格式化的直表(不是交叉表)之后,这很容易以多种方式完成,我更喜欢SQL。我在下面演示了SQLDDF-package,这对于大数据效率非常低,但是它很小,所以它可以工作。

而不是效率很低的内置函数,例如read.csv,我会在data.table包中引用非常有效的fread来读取文件。

<强> SQLDF

  

enter image description here

> library(data.table);
> a<-fread("~/DS_straight_table.csv");
> sqldf("select time from a where Experiment='A' and Field='An'")
  time
1    9

其他没有sqldf

> library(data.table);
> a<-fread("~/DS_straight_table.csv");
> a[Experiment=='A' & Field=='An'] 
Field time   T Experiment
1:    An    9 120          A

答案 2 :(得分:1)

使用&#34; Tall&#34; (直表)格式和库dplyr。您的数据每个字段只有一个值,即实验。

library(dplyr)    

## this is the more general result
df %>% 
  group_by(Field, Experiment) %>%
  top_n(1, wt = -time)


## example function
getTimes<- function(data, field, experiment) {
  data %>% 
    filter(Field == field, Experiment == experiment) %>%
    top_n(1, wt = -time)
}


getTimes(df, 'An', 'A')

#   Field time   T Experiment
# 1    An    9 120          A