我需要为x和y编写一个算法并执行两个样本 t -test(不使用内置函数)
x = c(2,4,6,8,9,10,12,14)
y = c(3,5,7,9,12,13,15,18)
双样本 t - 测试适用于两个样本x和y。该函数的参数应包括:x,y,假设的均值差delta0, 以及表示左尾,右尾或双尾测试的选项。
我如何使用R? 我需要一个代码,而不仅仅是内置功能。</ p>
到目前为止我已经这样做了,但是我需要t_test
函数来返回样本的样本大小和样本均值,它的自由度和p值
x = c(2,4,6,8,9,10,12,14)
y = c(3,5,7,9,12,13,15,18)
nx = length(x)
ny = length(y)
sp = sqrt(((nx-1)*var(x)+(ny-1)*var(y))/(nx+ny-2))
s1 = sp*sqrt(1/nx+1/ny)
mu0 = 0
t = (sample.mean(x)-sample.mean(y)-mu0) / s1
t
这是2个数据集。文件“expr_data&#34;包含17个乳腺癌患者的微阵列基因表达数据,每个患者具有以“GSM”开头的ID字符串。该数据文件中的每一行代表一个基因(探针组)。这17名患者属于三个不同的治疗组:对照组,治疗组1和治疗组2,其组成员资格在&#34; group_data&#34;
中指定数据集1
> head(expr_data)
GSM119944 GSM119945 GSM119946 GSM119947 GSM119948 GSM119949
1007_s_at 11.376519 11.826743 11.123022 11.743439 12.172961 11.522009
1053_at 7.270398 7.534450 7.169297 7.730833 6.728914 7.033900
117_at 8.172823 8.350568 8.216073 8.052177 7.940714 8.122496
121_at 10.064195 11.193688 9.846189 10.549992 10.172722 10.357284
1255_g_at 6.256425 6.830607 5.825010 6.098157 6.104971 5.818458
1294_at 9.347887 9.540260 9.229501 9.464348 9.764903 9.962180
GSM119950 GSM119951 GSM119952 GSM119953 GSM119954 GSM119955
1007_s_at 11.288407 11.364544 11.783231 12.102697 12.141934 12.141672
1053_at 8.152550 7.357942 7.811469 6.704366 7.723678 7.607720
117_at 8.246269 7.597745 8.809971 7.299070 7.808597 8.390707
121_at 11.361081 10.446139 11.165541 10.285435 10.123556 10.532735
1255_g_at 6.355995 6.311312 7.366574 5.577412 4.570794 5.046956
1294_at 9.300450 9.230649 9.783263 8.749285 9.466965 9.653450
GSM119956 GSM119957 GSM119958 GSM119959 GSM119960 GeneSymbol
1007_s_at 11.541161 12.069206 11.529456 9.692066 11.242988 DDR1
1053_at 6.904579 6.837490 7.437899 7.608960 6.704648 RFC2
117_at 7.653514 8.680945 8.050873 9.242006 8.253535 HSPA6
121_at 10.379335 10.487541 10.542419 10.248043 10.207259 PAX8
1255_g_at 6.561945 5.897955 5.402725 5.957542 6.201037 GUCA1A
1294_at 9.076623 9.827835 9.096732 9.441370 9.102000 UBA7
PublicID
1007_s_at U48705
1053_at M87338
117_at X51757
121_at X69699
1255_g_at L36861
1294_at L13852
数据集2
> groups_data
PatientID TreatmentGroup
1 GSM119946 Control
2 GSM119948 Control
3 GSM119951 Control
4 GSM119955 Control
5 GSM119956 Control
6 GSM119959 Control
7 GSM119947 Treatment1
8 GSM119950 Treatment1
9 GSM119952 Treatment1
10 GSM119953 Treatment1
11 GSM119957 Treatment1
12 GSM119958 Treatment1
13 GSM119944 Treatment2
14 GSM119945 Treatment2
15 GSM119949 Treatment2
16 GSM119954 Treatment2
17 GSM119960 Treatment2
使用我正在编写的双样本t检验函数,我需要单独测试所有基因(比较对照患者组和治疗1患者组),并假设mu_control = mu_treat1和mu_control&lt的替代假设; mu_treat1。
如果有帮助
,这里合并了两个数据集头(groups_expr) GSM119944 GSM119945 GSM119946 GSM119947 GSM119948 GSM119949 GSM119950 1 11.376519 11.826743 11.123022 11.743439 12.172961 11.522009 11.288407 2 7.270398 7.534450 7.169297 7.730833 6.728914 7.033900 8.152550 3 8.172823 8.350568 8.216073 8.052177 7.940714 8.122496 8.246269 4 10.064195 11.193688 9.846189 10.549992 10.172722 10.357284 11.361081 5 6.256425 6.830607 5.825010 6.098157 6.104971 5.818458 6.355995 6 9.347887 9.540260 9.229501 9.464348 9.764903 9.962180 9.300450 GSM119951 GSM119952 GSM119953 GSM119954 GSM119955 GSM119956 GSM119957 1 11.364544 11.783231 12.102697 12.141934 12.141672 11.541161 12.069206 2 7.357942 7.811469 6.704366 7.723678 7.607720 6.904579 6.837490 3 7.597745 8.809971 7.299070 7.808597 8.390707 7.653514 8.680945 4 10.446139 11.165541 10.285435 10.123556 10.532735 10.379335 10.487541 5 6.311312 7.366574 5.577412 4.570794 5.046956 6.561945 5.897955 6 9.230649 9.783263 8.749285 9.466965 9.653450 9.076623 9.827835 GSM119958 GSM119959 GSM119960 GeneSymbol PublicID PatientID TreatmentGroup 1 11.529456 9.692066 11.242988 DDR1 U48705 GSM119946控制 2 7.437899 7.608960 6.704648 RFC2 M87338 GSM119946控制 3 8.050873 9.242006 8.253535 HSPA6 X51757 GSM119946控制 4 10.542419 10.248043 10.207259 PAX8 X69699 GSM119946控制 5 5.402725 5.957542 6.201037 GUCA1A L36861 GSM119946控制 6 9.096732 9.441370 9.102000 UBA7 L13852 GSM119946控制
尾(groups_expr) GSM119944 GSM119945 GSM119946 GSM119947 GSM119948 GSM119949 GSM119950 378806 4.671951 4.731546 3.364612 2.893266 2.450373 4.6563807 4.375824 378807 2.954090 4.653969 2.695438 3.193373 3.685037 3.9202165 5.387476 378808 3.159816 5.216588 3.989162 5.387770 5.579206 5.9640708 4.796789 378809 1.464918 1.892150 1.398225 1.780359 1.477039 0.8966322 5.217179 378810 3.567588 3.642495 5.003216 3.565525 4.190032 3.2998454 4.903368 378811 2.959766 3.164650 1.462571 2.681616 2.646549 3.3482051 3.317340 GSM119951 GSM119952 GSM119953 GSM119954 GSM119955 GSM119956 GSM119957 378806 3.501316 5.121043 2.957501 3.072479 3.395843 3.183937 3.332907 378807 4.008853 3.808073 3.356645 3.979238 3.327875 3.143567 3.500472 378808 2.468878 4.937979 3.568130 3.105428 5.978494 3.431517 5.485591 378809 4.893662 2.465712 1.967586 1.632630 1.051223 2.272937 1.399148 378810 5.079019 3.653048 2.997752 4.118145 4.460848 5.101762 3.812710 378811 1.259031 2.661944 2.537223 2.692363 2.333142 1.011025 2.732608 GSM119958 GSM119959 GSM119960 GeneSymbol PublicID PatientID 378806 4.230085 3.740862 2.963901 AFFX-ThrX-3 GSM119960 378807 3.405755 3.703066 3.421292 AFFX-ThrX-5 GSM119960 378808 4.333555 5.543589 3.771600 AFFX-ThrX-M GSM119960 378809 4.217012 2.025573 2.080592 AFFX-TrpnX-3 GSM119960 378810 5.254337 3.054821 4.731657 AFFX-TrpnX-5 GSM119960 378811 2.621702 1.619972 2.243780 AFFX-TrpnX-M GSM119960 TreatmentGroup 378806治疗2 378807治疗2 378808治疗2 378809治疗2 378810治疗2 378811治疗2
有378811行,我需要对所有这些行(基因)进行t检验,以比较Treatment1和Control患者(GSM ******是患者ID)。
答案 0 :(得分:0)
请参阅here以获取公式参考。
x <- c(2,4,6,8,9,10,12,14)
y <- c(3,5,7,9,12,13,15,18)
tt <- function(x,y,mu0=0,ts=TRUE) { # two-sided t-test
nx <- length(x)
ny <- length(y)
sp <- sqrt(((nx-1)*var(x)+(ny-1)*var(y))/(nx+ny-2))
t <- (mean(x)-mean(y)-mu0) / (sp*sqrt(1/nx+1/ny))
df <- (var(x)/nx + var(y)/ny)^2 /
((var(x)/nx)^2/(nx-1) + (var(y)/ny)^2/(ny-1))
sample_sizes <- c(nx, ny)
names(sample_sizes) <- c("x","y")
sample_means <- c(mean(x), mean(y))
names(sample_means) <- c("x", "y")
pvalue <- ifelse(ts,2,1)*(1-pt(abs(t),df=df))
list(sample_sizes=sample_sizes, sample_means=sample_means,
df=df, pvalue=pvalue)
}
tt(x,y)
# $sample_sizes
# x y
# 8 8
#
# $sample_means
# x y
# 8.125 10.250
#
# $df
# [1] 13.21697
#
# $pvalue
# [1] 0.3737564
修改强>
以下是您可以将此功能用于示例的方法。这是第一个数据集。
expr_data <- data.frame(matrix(
c("1007_s_at", "11.376519", "11.826743", "11.123022", "11.743439", "12.172961",
"11.522009", "11.288407", "11.364544", "11.783231", "12.102697", "12.141934",
"12.141672", "11.541161", "12.069206", "11.529456", "9.692066", "11.242988",
"DDR1", "U48705", "1053_at", "7.270398", "7.534450", "7.169297", "7.730833",
"6.728914", "7.033900", "8.152550", "7.357942", "7.811469", "6.704366",
"7.723678", "7.607720", "6.904579", "6.837490", "7.437899", "7.608960",
"6.704648", "RFC2", "M87338", "117_at", "8.172823", "8.350568", "8.216073",
"8.052177", "7.940714", "8.122496", "8.246269", "7.597745", "8.809971",
"7.299070", "7.808597", "8.390707", "7.653514", "8.680945", "8.050873",
"9.242006", "8.253535", "HSPA6", "X51757", "121_at", "10.064195",
"11.193688", "9.846189", "10.549992", "10.172722", "10.357284", "11.361081",
"10.446139", "11.165541", "10.285435", "10.123556", "10.532735",
"10.379335", "10.487541", "10.542419", "10.248043", "10.207259", "PAX8",
"X69699", "1255_g_at", "6.256425", "6.830607", "5.825010", "6.098157",
"6.104971", "5.818458", "6.355995", "6.311312", "7.366574", "5.577412",
"4.570794", "5.046956", "6.561945", "5.897955", "5.402725", "5.957542",
"6.201037", "GUCA1A", "L36861", "1294_at", "9.347887", "9.540260",
"9.229501", "9.464348", "9.764903", "9.962180", "9.300450", "9.230649",
"9.783263","8.749285", "9.466965", "9.653450", "9.076623", "9.827835",
"9.096732", "9.441370", "9.102000", "UBA7", "L13852"), nrow=6, byrow=TRUE))
row.names(expr_data) <- expr_data[,1]
expr_data <- expr_data[,-1]
names(expr_data) <- c("GSM119944", "GSM119945", "GSM119946", "GSM119947",
"GSM119948", "GSM119949", "GSM119950", "GSM119951",
"GSM119952", "GSM119953", "GSM119954", "GSM119955",
"GSM119956", "GSM119957", "GSM119958", "GSM119959",
"GSM119960", "GeneSymbol", "PublicID")
expr_data[,1:17] <- sapply(expr_data[,1:17], function(x)
as.numeric(as.character(x)))
第二个数据集。
groups_data <- data.frame(
PatientID=c('GSM119946','GSM119948','GSM119951','GSM119955','GSM119956',
'GSM119959','GSM119947','GSM119950','GSM119952','GSM119953',
'GSM119957','GSM119958','GSM119944','GSM119945','GSM119949',
'GSM119954','GSM119960'),
TreatmentGroup = c(rep('Control',6), rep('Treatment1',6),
rep('Treatment2',5))
)
然后,进行适当的测试。
control_index <- which(groups_data$TreatmentGroup=="Control")
treatment_index <- which(groups_data$TreatmentGroup=="Treatment1")
# assume length(control_index) = length(treatment_index)
for(i in 1:length(control_index)) {
control_group <- expr_data[,groups_data$PatientID[control_index[i]]]
treatment_group <- expr_data[,groups_data$PatientID[treatment_index[i]]]
cat("T-test for", as.character(groups_data$PatientID[control_index[i]]), "and",
as.character(groups_data$PatientID[treatment_index[i]]), "\n")
result <- tt(control_group, treatment_group, 0, FALSE)
cat(" sample sizes:", as.numeric(result$sample_sizes),"\n")
cat(" sample means:", as.numeric(result$sample_means),"\n")
cat(" degrees of freedom:", as.numeric(result$df),"\n")
cat(" p-value:", as.numeric(result$pvalue),"\n\n")
}
<强>输出强>
T-test for GSM119946 and GSM119947
sample sizes: 6 6
sample means: 8.568182 8.939824
degrees of freedom: 9.947607
p-value: 0.3759984
T-test for GSM119948 and GSM119950
sample sizes: 6 6
sample means: 8.814198 9.117459
degrees of freedom: 9.744115
p-value: 0.4053794
T-test for GSM119951 and GSM119952
sample sizes: 6 6
sample means: 8.718055 9.453342
degrees of freedom: 9.916609
p-value: 0.2560534
T-test for GSM119955 and GSM119953
sample sizes: 6 6
sample means: 8.89554 8.453044
degrees of freedom: 9.996676
p-value: 0.38034
T-test for GSM119956 and GSM119957
sample sizes: 6 6
sample means: 8.686193 8.966829
degrees of freedom: 9.792449
p-value: 0.4132705
T-test for GSM119959 and GSM119958
sample sizes: 6 6
sample means: 8.698331 8.676684
degrees of freedom: 9.134459
p-value: 0.492472
编辑2
如果要测试行,可以使用以下代码。
control_cols <- groups_data$PatientID[which(groups_data$TreatmentGroup=="Control")]
treatment_cols <- groups_data$PatientID[which(groups_data$TreatmentGroup=="Treatment1")]
nrows <- dim(expr_data)[1]
for(i in 1:nrows) {
control_group <- expr_data[i,control_cols]
treatment_group <- expr_data[i,treatment_cols]
cat("T-test for control vs treatment (", row.names(treatment_group), ")\n")
result <- tt(as.numeric(control_group), as.numeric(treatment_group), 0, FALSE)
cat(" sample sizes:", as.numeric(result$sample_sizes),"\n")
cat(" sample means:", as.numeric(result$sample_means),"\n")
cat(" degrees of freedom:", as.numeric(result$df),"\n")
cat(" p-value:", as.numeric(result$pvalue),"\n\n")
}