如何编写独立的2样本t检验(x,y)

时间:2015-12-08 02:53:19

标签: r

我需要为x和y编写一个算法并执行两个样本 t -test(不使用内置函数)

x = c(2,4,6,8,9,10,12,14)
y = c(3,5,7,9,12,13,15,18)

双样本 t - 测试适用于两个样本x和y。该函数的参数应包括:x,y,假设的均值差delta0, 以及表示左尾,右尾或双尾测试的选项。

我如何使用R? 我需要一个代码,而不仅仅是内置功能。<​​/ p>

到目前为止我已经这样做了,但是我需要t_test函数来返回样本的样本大小和样本均值,它的自由度和p值

x = c(2,4,6,8,9,10,12,14)
y = c(3,5,7,9,12,13,15,18)
nx = length(x)
ny = length(y)
sp = sqrt(((nx-1)*var(x)+(ny-1)*var(y))/(nx+ny-2))
s1 = sp*sqrt(1/nx+1/ny)
mu0 = 0
t = (sample.mean(x)-sample.mean(y)-mu0) / s1
t

这是2个数据集。文件“expr_data&#34;包含17个乳腺癌患者的微阵列基因表达数据,每个患者具有以“GSM”开头的ID字符串。该数据文件中的每一行代表一个基因(探针组)。这17名患者属于三个不同的治疗组:对照组,治疗组1和治疗组2,其组成员资格在&#34; group_data&#34;

中指定

数据集1

> head(expr_data)
          GSM119944 GSM119945 GSM119946 GSM119947 GSM119948 GSM119949
1007_s_at 11.376519 11.826743 11.123022 11.743439 12.172961 11.522009
1053_at    7.270398  7.534450  7.169297  7.730833  6.728914  7.033900
117_at     8.172823  8.350568  8.216073  8.052177  7.940714  8.122496
121_at    10.064195 11.193688  9.846189 10.549992 10.172722 10.357284
1255_g_at  6.256425  6.830607  5.825010  6.098157  6.104971  5.818458
1294_at    9.347887  9.540260  9.229501  9.464348  9.764903  9.962180
          GSM119950 GSM119951 GSM119952 GSM119953 GSM119954 GSM119955
1007_s_at 11.288407 11.364544 11.783231 12.102697 12.141934 12.141672
1053_at    8.152550  7.357942  7.811469  6.704366  7.723678  7.607720
117_at     8.246269  7.597745  8.809971  7.299070  7.808597  8.390707
121_at    11.361081 10.446139 11.165541 10.285435 10.123556 10.532735
1255_g_at  6.355995  6.311312  7.366574  5.577412  4.570794  5.046956
1294_at    9.300450  9.230649  9.783263  8.749285  9.466965  9.653450
          GSM119956 GSM119957 GSM119958 GSM119959 GSM119960 GeneSymbol
1007_s_at 11.541161 12.069206 11.529456  9.692066 11.242988       DDR1
1053_at    6.904579  6.837490  7.437899  7.608960  6.704648       RFC2
117_at     7.653514  8.680945  8.050873  9.242006  8.253535      HSPA6
121_at    10.379335 10.487541 10.542419 10.248043 10.207259       PAX8
1255_g_at  6.561945  5.897955  5.402725  5.957542  6.201037     GUCA1A
1294_at    9.076623  9.827835  9.096732  9.441370  9.102000       UBA7
          PublicID
1007_s_at   U48705
1053_at     M87338
117_at      X51757
121_at      X69699
1255_g_at   L36861
1294_at     L13852

数据集2

> groups_data
   PatientID TreatmentGroup
1  GSM119946        Control
2  GSM119948        Control
3  GSM119951        Control
4  GSM119955        Control
5  GSM119956        Control
6  GSM119959        Control
7  GSM119947     Treatment1
8  GSM119950     Treatment1
9  GSM119952     Treatment1
10 GSM119953     Treatment1
11 GSM119957     Treatment1
12 GSM119958     Treatment1
13 GSM119944     Treatment2
14 GSM119945     Treatment2
15 GSM119949     Treatment2
16 GSM119954     Treatment2
17 GSM119960     Treatment2

使用我正在编写的双样本t检验函数,我需要单独测试所有基因(比较对照患者组和治疗1患者组),并假设mu_control = mu_treat1和mu_control&lt的替代假设; mu_treat1。

如果有帮助

,这里合并了两个数据集

头(groups_expr)   GSM119944 GSM119945 GSM119946 GSM119947 GSM119948 GSM119949 GSM119950 1 11.376519 11.826743 11.123022 11.743439 12.172961 11.522009 11.288407 2 7.270398 7.534450 7.169297 7.730833 6.728914 7.033900 8.152550 3 8.172823 8.350568 8.216073 8.052177 7.940714 8.122496 8.246269 4 10.064195 11.193688 9.846189 10.549992 10.172722 10.357284 11.361081 5 6.256425 6.830607 5.825010 6.098157 6.104971 5.818458 6.355995 6 9.347887 9.540260 9.229501 9.464348 9.764903 9.962180 9.300450   GSM119951 GSM119952 GSM119953 GSM119954 GSM119955 GSM119956 GSM119957 1 11.364544 11.783231 12.102697 12.141934 12.141672 11.541161 12.069206 2 7.357942 7.811469 6.704366 7.723678 7.607720 6.904579 6.837490 3 7.597745 8.809971 7.299070 7.808597 8.390707 7.653514 8.680945 4 10.446139 11.165541 10.285435 10.123556 10.532735 10.379335 10.487541 5 6.311312 7.366574 5.577412 4.570794 5.046956 6.561945 5.897955 6 9.230649 9.783263 8.749285 9.466965 9.653450 9.076623 9.827835   GSM119958 GSM119959 GSM119960 GeneSymbol PublicID PatientID TreatmentGroup 1 11.529456 9.692066 11.242988 DDR1 U48705 GSM119946控制 2 7.437899 7.608960 6.704648 RFC2 M87338 GSM119946控制 3 8.050873 9.242006 8.253535 HSPA6 X51757 GSM119946控制 4 10.542419 10.248043 10.207259 PAX8 X69699 GSM119946控制 5 5.402725 5.957542 6.201037 GUCA1A L36861 GSM119946控制 6 9.096732 9.441370 9.102000 UBA7 L13852 GSM119946控制

尾(groups_expr)        GSM119944 GSM119945 GSM119946 GSM119947 GSM119948 GSM119949 GSM119950 378806 4.671951 4.731546 3.364612 2.893266 2.450373 4.6563807 4.375824 378807 2.954090 4.653969 2.695438 3.193373 3.685037 3.9202165 5.387476 378808 3.159816 5.216588 3.989162 5.387770 5.579206 5.9640708 4.796789 378809 1.464918 1.892150 1.398225 1.780359 1.477039 0.8966322 5.217179 378810 3.567588 3.642495 5.003216 3.565525 4.190032 3.2998454 4.903368 378811 2.959766 3.164650 1.462571 2.681616 2.646549 3.3482051 3.317340        GSM119951 GSM119952 GSM119953 GSM119954 GSM119955 GSM119956 GSM119957 378806 3.501316 5.121043 2.957501 3.072479 3.395843 3.183937 3.332907 378807 4.008853 3.808073 3.356645 3.979238 3.327875 3.143567 3.500472 378808 2.468878 4.937979 3.568130 3.105428 5.978494 3.431517 5.485591 378809 4.893662 2.465712 1.967586 1.632630 1.051223 2.272937 1.399148 378810 5.079019 3.653048 2.997752 4.118145 4.460848 5.101762 3.812710 378811 1.259031 2.661944 2.537223 2.692363 2.333142 1.011025 2.732608        GSM119958 GSM119959 GSM119960 GeneSymbol PublicID PatientID 378806 4.230085 3.740862 2.963901 AFFX-ThrX-3 GSM119960 378807 3.405755 3.703066 3.421292 AFFX-ThrX-5 GSM119960 378808 4.333555 5.543589 3.771600 AFFX-ThrX-M GSM119960 378809 4.217012 2.025573 2.080592 AFFX-TrpnX-3 GSM119960 378810 5.254337 3.054821 4.731657 AFFX-TrpnX-5 GSM119960 378811 2.621702 1.619972 2.243780 AFFX-TrpnX-M GSM119960        TreatmentGroup 378806治疗2 378807治疗2 378808治疗2 378809治疗2 378810治疗2 378811治疗2

有378811行,我需要对所有这些行(基因)进行t检验,以比较Treatment1和Control患者(GSM ******是患者ID)。

1 个答案:

答案 0 :(得分:0)

请参阅here以获取公式参考。

x <- c(2,4,6,8,9,10,12,14)
y <- c(3,5,7,9,12,13,15,18)

tt <- function(x,y,mu0=0,ts=TRUE) { # two-sided t-test
  nx <- length(x)
  ny <- length(y)
  sp <- sqrt(((nx-1)*var(x)+(ny-1)*var(y))/(nx+ny-2))
  t <- (mean(x)-mean(y)-mu0) / (sp*sqrt(1/nx+1/ny))
  df <- (var(x)/nx + var(y)/ny)^2 / 
    ((var(x)/nx)^2/(nx-1) + (var(y)/ny)^2/(ny-1))

  sample_sizes <- c(nx, ny)
  names(sample_sizes) <- c("x","y")
  sample_means <- c(mean(x), mean(y))
  names(sample_means) <- c("x", "y")
  pvalue <- ifelse(ts,2,1)*(1-pt(abs(t),df=df))

  list(sample_sizes=sample_sizes, sample_means=sample_means,
       df=df, pvalue=pvalue)
}

tt(x,y)

# $sample_sizes
# x y 
# 8 8 
#
# $sample_means
#     x      y 
# 8.125 10.250 
#
# $df
# [1] 13.21697
#
# $pvalue
# [1] 0.3737564

修改

以下是您可以将此功能用于示例的方法。这是第一个数据集。

expr_data <- data.frame(matrix(
c("1007_s_at", "11.376519", "11.826743", "11.123022", "11.743439", "12.172961", 
"11.522009", "11.288407", "11.364544", "11.783231", "12.102697", "12.141934", 
"12.141672", "11.541161", "12.069206", "11.529456", "9.692066", "11.242988", 
"DDR1", "U48705", "1053_at", "7.270398", "7.534450", "7.169297", "7.730833", 
"6.728914", "7.033900", "8.152550", "7.357942", "7.811469", "6.704366",  
"7.723678", "7.607720", "6.904579", "6.837490", "7.437899", "7.608960", 
"6.704648", "RFC2", "M87338", "117_at", "8.172823", "8.350568", "8.216073", 
"8.052177", "7.940714", "8.122496", "8.246269", "7.597745", "8.809971", 
"7.299070", "7.808597", "8.390707", "7.653514", "8.680945", "8.050873", 
"9.242006", "8.253535", "HSPA6", "X51757", "121_at", "10.064195", 
"11.193688", "9.846189", "10.549992", "10.172722", "10.357284", "11.361081", 
"10.446139", "11.165541", "10.285435", "10.123556", "10.532735", 
"10.379335", "10.487541", "10.542419", "10.248043", "10.207259", "PAX8", 
"X69699", "1255_g_at", "6.256425", "6.830607", "5.825010", "6.098157", 
"6.104971", "5.818458", "6.355995", "6.311312", "7.366574", "5.577412", 
"4.570794", "5.046956", "6.561945", "5.897955", "5.402725", "5.957542", 
"6.201037", "GUCA1A", "L36861", "1294_at", "9.347887", "9.540260", 
"9.229501", "9.464348", "9.764903", "9.962180", "9.300450", "9.230649", 
"9.783263","8.749285", "9.466965", "9.653450", "9.076623", "9.827835", 
"9.096732", "9.441370", "9.102000", "UBA7", "L13852"), nrow=6, byrow=TRUE))

row.names(expr_data) <- expr_data[,1]
expr_data <- expr_data[,-1]
names(expr_data) <- c("GSM119944", "GSM119945", "GSM119946", "GSM119947", 
                      "GSM119948", "GSM119949", "GSM119950", "GSM119951", 
                      "GSM119952", "GSM119953", "GSM119954", "GSM119955", 
                      "GSM119956", "GSM119957", "GSM119958", "GSM119959", 
                      "GSM119960", "GeneSymbol", "PublicID")
expr_data[,1:17] <- sapply(expr_data[,1:17], function(x)
  as.numeric(as.character(x)))    

第二个数据集。

groups_data <- data.frame(
  PatientID=c('GSM119946','GSM119948','GSM119951','GSM119955','GSM119956',
              'GSM119959','GSM119947','GSM119950','GSM119952','GSM119953',
              'GSM119957','GSM119958','GSM119944','GSM119945','GSM119949',
              'GSM119954','GSM119960'),
  TreatmentGroup = c(rep('Control',6), rep('Treatment1',6), 
                     rep('Treatment2',5))
)

然后,进行适当的测试。

control_index <- which(groups_data$TreatmentGroup=="Control")
treatment_index <- which(groups_data$TreatmentGroup=="Treatment1")
# assume length(control_index) = length(treatment_index)
for(i in 1:length(control_index)) {
  control_group <- expr_data[,groups_data$PatientID[control_index[i]]]
  treatment_group <- expr_data[,groups_data$PatientID[treatment_index[i]]]
  cat("T-test for", as.character(groups_data$PatientID[control_index[i]]), "and", 
      as.character(groups_data$PatientID[treatment_index[i]]), "\n")
  result <- tt(control_group, treatment_group, 0, FALSE)
  cat("  sample sizes:", as.numeric(result$sample_sizes),"\n")
  cat("  sample means:", as.numeric(result$sample_means),"\n")
  cat("  degrees of freedom:", as.numeric(result$df),"\n")
  cat("  p-value:", as.numeric(result$pvalue),"\n\n")
}

<强>输出

T-test for GSM119946 and GSM119947 
sample sizes: 6 6 
sample means: 8.568182 8.939824 
degrees of freedom: 9.947607 
p-value: 0.3759984 

T-test for GSM119948 and GSM119950 
sample sizes: 6 6 
sample means: 8.814198 9.117459 
degrees of freedom: 9.744115 
p-value: 0.4053794 

T-test for GSM119951 and GSM119952 
sample sizes: 6 6 
sample means: 8.718055 9.453342 
degrees of freedom: 9.916609 
p-value: 0.2560534 

T-test for GSM119955 and GSM119953 
sample sizes: 6 6 
sample means: 8.89554 8.453044 
degrees of freedom: 9.996676 
p-value: 0.38034 

T-test for GSM119956 and GSM119957 
sample sizes: 6 6 
sample means: 8.686193 8.966829 
degrees of freedom: 9.792449 
p-value: 0.4132705 

T-test for GSM119959 and GSM119958 
sample sizes: 6 6 
sample means: 8.698331 8.676684 
degrees of freedom: 9.134459 
p-value: 0.492472 

编辑2

如果要测试行,可以使用以下代码。

control_cols <- groups_data$PatientID[which(groups_data$TreatmentGroup=="Control")]
treatment_cols <- groups_data$PatientID[which(groups_data$TreatmentGroup=="Treatment1")]

nrows <- dim(expr_data)[1]
for(i in 1:nrows) {
  control_group <- expr_data[i,control_cols]
  treatment_group <- expr_data[i,treatment_cols]

  cat("T-test for control vs treatment (", row.names(treatment_group), ")\n")
  result <- tt(as.numeric(control_group), as.numeric(treatment_group), 0, FALSE)
  cat("  sample sizes:", as.numeric(result$sample_sizes),"\n")
  cat("  sample means:", as.numeric(result$sample_means),"\n")
  cat("  degrees of freedom:", as.numeric(result$df),"\n")
  cat("  p-value:", as.numeric(result$pvalue),"\n\n")
}