使用R在数据集本身中为值指定值

时间:2016-07-28 02:16:22

标签: r

我对R.很新。 我得到了一列数据,其中有大约26000个数据,该列包含大约1200个唯一数据。我们假设该列的名称为“Breed”。

我需要的是,

  1. 我需要获取列中每个唯一值的频率。

    我已经提取了BreedType和频率,如下所示。 (品种列的名称为BreedType)

  2. 然后,如果每个BreedType的频率小于50,则使用if条件我需要有一个带有'F'的新列,如果大于50,则需要为该列赋值'Breedtype'

  3. 这是我尝试过的。

    x<- sort(table(full$Breed),decreasing=T)
    w=as.data.frame(x)
    
    names(w)[1] = 'BreedType'
    
    w$TrueFalse<-ifelse(w$Freq<50,F,w$BreedType)
    w$TrueFalse
    

    但是给出的输出并不是我的预期。 虽然F正确分配每一列,但W $ BreedType不会获得BreedType的值,而是逐个增加而不是给出特定BreedType的整数。

    有人可以解释一下为什么未按预期提供输出。

    品种栏在数据集中如下所示,包含20,000行和1200个唯一值。

     Breed
    
     Shetland Sheepdog Mix
     Domestic Shorthair Mix
     Pit Bull Mix
     Domestic Shorthair Mix
     Lhasa Apso/Miniature Poodle
     Cairn Terrier/Chihuahua Shorthair
     Domestic Shorthair Mix
     Domestic Shorthair Mix
     American Pit Bull Terrier Mix
     Cairn Terrier
     Domestic Shorthair Mix
     Miniature Schnauzer Mix
     Pit Bull Mix
     Yorkshire Terrier Mix
     Great Pyrenees Mix
     Domestic Shorthair Mix
     Domestic Shorthair Mix
     Pit Bull Mix
     Angora Mix
     Flat Coat Retriever Mix
     Queensland Heeler Mix
     Domestic Shorthair Mix
     Plott Hound/Boxer
    

    我的预期结果是,

    BreedType                   Frequency   TrueFalse
    
    Shetland Sheepdog Mix       60          Shetland Sheepdog Mix  
    Domestic Shorthair Mix      20          F
    Pit Bull Mix                80          Pit Bull Mix
    Domestic Shorthair Mix      10          F
    

5 个答案:

答案 0 :(得分:2)

原始数据 - full数据框:

> full
#                      Breed
# 1:             Shetland Sheepdog Mix
# 2:            Domestic Shorthair Mix
# 3:                      Pit Bull Mix
# 4:            Domestic Shorthair Mix
# 5:       Lhasa Apso/Miniature Poodle
# 6: Cairn Terrier/Chihuahua Shorthair
# 7:            Domestic Shorthair Mix
# 8:            Domestic Shorthair Mix
# 9:     American Pit Bull Terrier Mix
# 10:                     Cairn Terrier
# 11:            Domestic Shorthair Mix
# 12:           Miniature Schnauzer Mix
# 13:                      Pit Bull Mix
# 14:             Yorkshire Terrier Mix
# 15:                Great Pyrenees Mix
# 16:            Domestic Shorthair Mix
# 17:            Domestic Shorthair Mix
# 18:                      Pit Bull Mix
# 19:                        Angora Mix
# 20:           Flat Coat Retriever Mix
# 21:             Queensland Heeler Mix
# 22:            Domestic Shorthair Mix
# 23:                 Plott Hound/Boxer
# Breed

在工作区中加载data.table库

library("data.table")

通过引用将full数据帧转换为数据表

setDT(full)

full数据表复制到dt1数据表。这样做是为了备份full数据表

dt1 <- copy(full)

通过BreedType(品种列)组dt1数据表,然后访问.N内部变量,该变量存储每个子集中的条目数并使用它执行ifelse条件。然后将其保存为Frequency和TrueFalse列变量。

dt1[, c("Frequency", "TrueFalse") := .(.N, ifelse(.N < 50, FALSE, Breed)), by = Breed]

在上述步骤

之后显示dt1
> dt1
#                          Breed          Frequency TrueFalse
# 1:             Shetland Sheepdog Mix         1     FALSE
# 2:            Domestic Shorthair Mix         8     FALSE
# 3:                      Pit Bull Mix         3     FALSE
# 4:            Domestic Shorthair Mix         8     FALSE
# 5:       Lhasa Apso/Miniature Poodle         1     FALSE
# 6: Cairn Terrier/Chihuahua Shorthair         1     FALSE
# 7:            Domestic Shorthair Mix         8     FALSE
# 8:            Domestic Shorthair Mix         8     FALSE
# 9:     American Pit Bull Terrier Mix         1     FALSE
# 10:                     Cairn Terrier         1     FALSE
# 11:            Domestic Shorthair Mix         8     FALSE
# 12:           Miniature Schnauzer Mix         1     FALSE
# 13:                      Pit Bull Mix         3     FALSE
# 14:             Yorkshire Terrier Mix         1     FALSE
# 15:                Great Pyrenees Mix         1     FALSE
# 16:            Domestic Shorthair Mix         8     FALSE
# 17:            Domestic Shorthair Mix         8     FALSE
# 18:                      Pit Bull Mix         3     FALSE
# 19:                        Angora Mix         1     FALSE
# 20:           Flat Coat Retriever Mix         1     FALSE
# 21:             Queensland Heeler Mix         1     FALSE
# 22:            Domestic Shorthair Mix         8     FALSE
# 23:                 Plott Hound/Boxer         1     FALSE
# Breed Frequency TrueFalse

您提供的数据不具有大于50的品种类型的频率。如果您有一个,则根据ifelse语句,将添加品种类型而不是FALSE。

答案 1 :(得分:2)

假设您对每个BreedType的频率实现已经有效。 这与@Sathish类似,但使用data.frame代替data.table

testData <- data.frame(BreedType = c("Shetland Sheepdog Mix", "Domestic Shorthair Mix", "Pit Bull Mix", "Domestic Shorthair Mix"),
                   Frequency = c(60, 20, 80, 10), stringsAsFactors = F)
testData$TrueFalse <- testData$BreedType
testData$TrueFalse[testData$Frequency < 50] <- F 

输出与您拥有的相同。然而,&#34;错误&#34;被转换为字符串(而不是布尔值),因为该列被初始化为字符向量。我不确定你可以混合使用布尔和字符串。

答案 2 :(得分:2)

您可以使用count包中的plyr功能。我已经使用您提供的数据演示了一个示例。

> library(plyr)

> df <- read.table(text = "Shetland Sheepdog Mix
  Domestic Shorthair Mix
  Pit Bull Mix
  Domestic Shorthair Mix
  Lhasa Apso/Miniature Poodle
  Cairn Terrier/Chihuahua Shorthair
  Domestic Shorthair Mix
  Domestic Shorthair Mix
  American Pit Bull Terrier Mix
  Cairn Terrier
  Domestic Shorthair Mix
  Miniature Schnauzer Mix
  Pit Bull Mix
  Yorkshire Terrier Mix
  Great Pyrenees Mix
  Domestic Shorthair Mix
  Domestic Shorthair Mix
  Pit Bull Mix
  Angora Mix
  Flat Coat Retriever Mix
  Queensland Heeler Mix
  Domestic Shorthair Mix
  Plott Hound/Boxer", sep='\n', stringsAsFactors = F, col.names = c('Breed'))

使用plyr::count功能。

> df <- count(df, 'Breed')

> df 

##                                 Breed freq
## 1       American Pit Bull Terrier Mix    1
## 2                          Angora Mix    1
## 3                       Cairn Terrier    1
## 4   Cairn Terrier/Chihuahua Shorthair    1
## 5              Domestic Shorthair Mix    8
## 6             Flat Coat Retriever Mix    1
## ...
## ...


> df$TrueFalse <- ifelse(df$freq >= 3, df$Breed, F)

> df

                                        Breed freq                    TrueFalse
## 1            American Pit Bull Terrier Mix    1                        FALSE
## 2                               Angora Mix    1                        FALSE
## 3                            Cairn Terrier    1                        FALSE
## 4        Cairn Terrier/Chihuahua Shorthair    1                        FALSE
## 5                   Domestic Shorthair Mix    8       Domestic Shorthair Mix
## 6                  Flat Coat Retriever Mix    1                        FALSE

答案 3 :(得分:0)

嗯,您也可以使用foreach($arrayA as $array) { // Check each array has level value 2 or not if ($array['level'] == 2) { // found value echo "found the array"; } } base R来获取频率

table

然后使用new_df <- data.frame(table(df$Breed)) # Var1 Freq #1 American Pit Bull Terrier Mix 1 #2 Angora Mix 1 #3 Cairn Terrier 1 #4 Cairn Terrier/Chihuahua Shorthair 1 #5 Domestic Shorthair Mix 8 #6 Flat Coat Retriever Mix 1 #7 Great Pyrenees Mix 1 #8 Lhasa Apso/Miniature Poodle 1 #9 Miniature Schnauzer Mix 1 #10 Pit Bull Mix 3 #11 Plott Hound/Boxer 1 #12 Queensland Heeler Mix 1 #13 Shetland Sheepdog Mix 1 #14 Yorkshire Terrier Mix 1 获取ifelse

的值
TrueFalse

答案 4 :(得分:0)

如果我们需要汇总输出,那么

library(data.table)
setDT(df)[, .(Frequency = .N, TrueFalse = .N > 55), by = Breed]