I'm new to R and have a large data frame with which I'd like to separate the data by a single letter in the column name, and then append a column at the end containing the average of the row. The data looks as follows:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 gene_id gene_symbol Chr Biotype L001P L003P L004P L005P L008P
2 ENSG00000000003 TSPAN6 X protein_coding 31.8003 67.3098 63.033 63.83 38.6941
3 ENSG00000000005 TNMD X protein_coding 0.0372353 2.28841 0.032932 0 0.358512
4 ENSG00000000419 DPM1 20 protein_coding 17.5575 43.7474 21.0119 22.9765 26.3166
5 ENSG00000000457 SCYL3 1 protein_coding 2.68196 3.7079 3.14505 3.82323 3.32028
6 ENSG00000000460 C1orf112 1 protein_coding 0.532179 2.46598 1.11985 0.584227 1.20095
There are around 70 columns and 13 rows, you can only see the columns with the "P" ending (V6:V10), however 39 columns down the data frame they end in "t". I was wondering how I'd separate these two "t" and "p", and then mean the rows.
I've tried apply
, lapply
, grep
and split
but still cant seem to separate them. Whenever I have tried to apply a mean it returns NA values across the board, now sure where to go from here.
答案 0 :(得分:1)
First of all you have read the data incorrectly (maybe selected header = FALSE
while importing).It looks like your first row should be your header and your actual data starts from row 2 onwards.
names(df) <- df[1, ] #Give 1st row as column names
df <- df[-1, ] #Delete 1st row
Once, we have that let's find out column which end with "t"
or "P"
cols <- grep("P$|t$", names(df))
Since we had messed up the 1st row previously the type of columns have changed and we need to convert cols
to numeric
df[cols] <- lapply(df[cols], as.numeric)
Now, we can take mean
of these rows using rowMeans
df$Mean <- rowMeans(df[cols], na.rm = TRUE)
I am not clear if you want to calculate the mean
of columns ending with "P"
and "t"
together or separately. The above calculates it together. If you want to calculate them separately you can do
p_cols <- grep("P$", names(df))
t_cols <- grep("t$", names(df))
df[c(p_cols, t_cols)] <- lapply(df[c(p_cols, t_cols)], as.numeric)
df$P_Mean <- rowMeans(df[p_cols], na.rm = TRUE)
df$T_Mean <- rowMeans(df[t_cols], na.rm = TRUE)
答案 1 :(得分:1)
here a data.table
approach:
As you don't provide any reproducible example data, I had to fabricate one:
# load library
library(data.table)
# create data.table as the column binding of some letters and some numbers
dt <- cbind(data.table(x = LETTERS[1:5]),
as.data.table(matrix(sample(1:30, 30, FALSE),
nrow = 5)))
# the names aren't right, so we need to fix them according to your requirement:
names(dt) <- c("x", "1T", "2T", "3T", "1P", "2P", "3P")
Now the working part: We will create a column (that's what :=
is for) that has the mean applied (that's the apply
and mean
functions) on some columns (that's the .SD
) that we need to define (that's the .SDcols
part).
But that definition is dyamic, depending on the last letter of the column name, so we use grep
:
dt[, averageTs := apply(.SD, 1, mean), .SDcols = grep("T$", names(dt))]
In here we're looking for a T at the end of the string, and the vector we'll be searching is that of the names
of the data.table itself.
Doing it for the Ps is just the same command, of course replacing Ts for Ps.
dt[, averagePs := apply(.SD, 1, mean), .SDcols = grep("P$", names(dt))]