Question

I'm new to R and have a large data frame with which I'd like to separate the data by a single letter in the column name, and then append a column at the end containing the average of the row. The data looks as follows:

    V1          V2  V3             V4        V5       V6       V7       V8       V9      V10      
1          gene_id gene_symbol Chr        Biotype     L001P    L003P    L004P    L005P    L008P   
2  ENSG00000000003      TSPAN6   X protein_coding   31.8003  67.3098   63.033    63.83  38.6941  
3  ENSG00000000005        TNMD   X protein_coding 0.0372353  2.28841 0.032932        0 0.358512        
4  ENSG00000000419        DPM1  20 protein_coding   17.5575  43.7474  21.0119  22.9765  26.3166  
5  ENSG00000000457       SCYL3   1 protein_coding   2.68196   3.7079  3.14505  3.82323  3.32028  
6  ENSG00000000460    C1orf112   1 protein_coding  0.532179  2.46598  1.11985 0.584227  1.20095

There are around 70 columns and 13 rows, you can only see the columns with the "P" ending (V6:V10), however 39 columns down the data frame they end in "t". I was wondering how I'd separate these two "t" and "p", and then mean the rows.

I've tried apply, lapply, grep and split but still cant seem to separate them. Whenever I have tried to apply a mean it returns NA values across the board, now sure where to go from here.

Answer 1

First of all you have read the data incorrectly (maybe selected header = FALSE while importing).It looks like your first row should be your header and your actual data starts from row 2 onwards.

names(df) <- df[1, ] #Give 1st row as column names
df  <- df[-1, ]      #Delete 1st row

Once, we have that let's find out column which end with "t" or "P"

cols <- grep("P$|t$", names(df))

Since we had messed up the 1st row previously the type of columns have changed and we need to convert cols to numeric

df[cols] <- lapply(df[cols], as.numeric)

Now, we can take mean of these rows using rowMeans

df$Mean <- rowMeans(df[cols], na.rm = TRUE)

I am not clear if you want to calculate the mean of columns ending with "P" and "t" together or separately. The above calculates it together. If you want to calculate them separately you can do

p_cols <- grep("P$", names(df))
t_cols <- grep("t$", names(df))
df[c(p_cols, t_cols)] <- lapply(df[c(p_cols, t_cols)], as.numeric)
df$P_Mean <- rowMeans(df[p_cols], na.rm = TRUE)
df$T_Mean <- rowMeans(df[t_cols], na.rm = TRUE)

Answer 2

here a data.table approach:

As you don't provide any reproducible example data, I had to fabricate one:

# load library

library(data.table)

# create data.table as the column binding of some letters and some numbers

dt <- cbind(data.table(x = LETTERS[1:5]), 
            as.data.table(matrix(sample(1:30, 30, FALSE), 
                                 nrow = 5)))

# the names aren't right, so we need to fix them according to your requirement:

names(dt) <- c("x", "1T", "2T", "3T", "1P", "2P", "3P")

Now the working part: We will create a column (that's what := is for) that has the mean applied (that's the apply and mean functions) on some columns (that's the .SD) that we need to define (that's the .SDcols part). But that definition is dyamic, depending on the last letter of the column name, so we use grep:

dt[, averageTs := apply(.SD, 1, mean), .SDcols = grep("T$", names(dt))]

In here we're looking for a T at the end of the string, and the vector we'll be searching is that of the names of the data.table itself.

Doing it for the Ps is just the same command, of course replacing Ts for Ps.

dt[, averagePs := apply(.SD, 1, mean), .SDcols = grep("P$", names(dt))]

How would I create a function to separate and average rows of this data

2 个答案: