Question

I'm investigating the role of a combination of tumor pattern in predicting its malignancy. I have this table of thyroid nodules characteristics described by 6 categorical variables (YES/NO).

ID color shape halo calcium margins solid
1    1     1    1      1       0      0
2    1     1    0      0       1      0
3    0     0    1      1       1      1
4    0     0    1      0       0      0
5    1     1    1      1       0      1

I would like to know the prevalence of the combination of the presence of the three of them. In this example would be:

          combination freq
color, shape, calcium   2
shape, halo,  calcium   2
color, shape, margins   1
....

I ended-up with the prevalence of each of them

as.data.frame(table(tiradsLong$caratteristica, tiradsLong$valore))

which is not my aim.

Thanks in advance, Angelo

Answer 1

Here is one solution I could come up with, which I am sure can be improved in elegance:

x <- combn(2:ncol(df), 3)
as.data.frame(do.call(rbind,
              apply(x, 2, function(y)
                    list(cols = names(df)[y],
                    value = sum(rowSums(df[, y]) == 3)))))

Output is:

                   cols value
1    color, shape, halo     2
2 color, shape, calcium     2
3 color, shape, margins     1
4   color, shape, solid     1
5  color, halo, calcium     2
...
...

In general, you may want to look at frequent itemsets and apriori (arules package) for such things.

Answer 2

The following solution depends on how your data is formatted. It would be very helpful if you provide some sample data via dput or similar.

Anyhow, the following is one of the many possible solutions.

df <- data.frame(ID = 1:50,
                 color = rbinom(50, size = 1, prob = 0.5),
                 shape = rbinom(50, size = 1, prob = 0.5),
                 halo = rbinom(50, size = 1, prob = 0.5),
                 calcium = rbinom(50, size = 1, prob = 0.5),
                 margins = rbinom(50, size = 1, prob = 0.5),
                 solid = rbinom(50, size = 1, prob = 0.5))

library(tidyverse)

df %>%
  gather("feature", "value", - ID) %>%
  filter(value == 1) %>%
  group_by(ID) %>%
  summarise(fdata = paste(sort(feature), collapse = "_")) %>%
  group_by(fdata) %>%
  summarise(count = n())

Using dplyr, first you need to transform your data into a long format. Then you can filter for your signals, i.e., 1. With grouping by ids you can encode sets of features and combine them into one string. The sort is nescessary, since we need to add some structure to the encoded strings. Afterwads we group by the encoded strings and count the number of ID in the group.

Edit: With the hint of @Gopala that you want only three groups you could add these lines to the above snippet:

... %>%
    mutate(threeCombos = purrr::map(fdata, function(.x) {
      splittedStrings = unlist(strsplit(.x, "_"))
      if (length(splittedStrings) > 2) {
        res <- data.frame(t(combn(splittedStrings, m = 3)), stringsAsFactors = FALSE) %>%
          unite("threecombs", starts_with("X"), sep = ",")
      } else {
        res <- data.frame()
      }
      return(res)
    })) %>% 
    unnest() %>%
    group_by(threecombs) %>%
    summarise(freq = sum(count))

This may compute faster than going throgh choose(n,m) combinations. But again, it depends on the further statisitical analysis what you want to do with the triplets.

Prevalence of combination of factors (in R)

2 个答案: