尝试将类别值保存在条形图

时间:2015-12-01 03:36:53

标签: r statistics

我有一个包含大约3000个观察结果的数据框。我不仅要分析整个,还要分析子样本,我创建如下:

SNIIPPET 1:

allophone.count.test <- subset (merged.data.for.study, Environment %in% curr.phon.env)

我有数据,其中某个类别(&#34; Allophone&#34;在下面的数据中)的值需要以条形图的特定顺序显示,如下所示:

[p], [p̚], [pʰ], [p͡ɸ], [ɸ], [b], [b͡β], [β], OTHER, ∅

为了建立正确的顺序,我在部分数据处理过程中为上述值分配了数字。这些值看起来像这样:

01. [p], 02. [p̚], 03. [pʰ], 04. [p͡ɸ], 05. [ɸ], 06. [b], 07. [b͡β], 08. [β], 09. OTHER, 10. ∅

这是一个示例数据框。请注意,在此阶段,Allophone和Allophone.Backup包含相同的值,以便以后检查错误:

allophone.count.test <- read.table(
    header=TRUE, sep="\t", text='Region Phoneme Allophone   Count   Total.Count Percentage  Allophone.Backup
LocationA   p   01. [p] 16  92  17.4    01. [p]
LocationA   p   02. [p̚]    4   92  4.3 02. [p̚]
LocationA   p   05. [ɸ] 8   92  8.7 05. [ɸ]
LocationA   p   06. [b] 5   92  5.4 06. [b]
LocationA   p   08. [β] 55  92  59.8    08. [β]
LocationA   p   09. OTHER   1   92  1.1 09. OTHER
LocationA   p   10. ∅   3   92  3.3 10. ∅
LocationB   p   01. [p] 19  136 14  01. [p]
LocationB   p   03. [pʰ]    1   136 0.7 03. [pʰ]
LocationB   p   05. [ɸ] 14  136 10.3    05. [ɸ]
LocationB   p   06. [b] 7   136 5.1 06. [b]
LocationB   p   08. [β] 88  136 64.7    08. [β]
LocationB   p   10. ∅   7   136 5.1 10. ∅'
)

这一切都很好,我尝试过的所有绘图工具(barplot,ggplot2和我目前正在使用的wrapper)按字母数字顺序排列这些值,所以当我用添加的数字绘制数据一切都很好。不幸的是,这些数字使这些情节看起来荒谬可笑,并且不会被接受发表。所以我需要在摆脱数字的同时保持正确的顺序。

问题在于,一旦我尝试绘制没有数字的值,我尝试的所有绘图工具都会按字母顺序排序。

我发现的大多数针对此问题的解决方案都说要将其转换为一个因素。以下是我用来(1)将其转换为因子和(2)除去前导数+期间+空间的代码:

SNIPPET 2:

allophone.count.test$Allophone <- factor (allophone.count.test$Allophone)
allophone.count.test$Allophone <- gsub ("[0-9][0-9]\\. ", "", allophone.count.test$Allophone, perl=TRUE)

这个看起来像一样,如下所示,因为Allophone值没有前导数字,句号或空格,并且保留了正确的顺序:

    Region  Phoneme Allophone   Count   Total.Count Percentage  Allophone.Backup
1   LocationA   p   [p] 16  92  17.4    01. [p]
2   LocationA   p   [p̚]    4   92  4.3 02. [p̚]
3   LocationA   p   [ɸ] 8   92  8.7 05. [ɸ]
4   LocationA   p   [b] 5   92  5.4 06. [b]
5   LocationA   p   [β] 55  92  59.8    08. [β]
6   LocationA   p   OTHER   1   92  1.1 09. OTHER
7   LocationA   p   ∅   3   92  3.3 10. ∅
8   LocationB   p   [p] 19  136 14.0    01. [p]
9   LocationB   p   [pʰ]    1   136 0.7 03. [pʰ]
10  LocationB   p   [ɸ] 14  136 10.3    05. [ɸ]
11  LocationB   p   [b] 7   136 5.1 06. [b]
12  LocationB   p   [β] 88  136 64.7    08. [β]
13  LocationB   p   ∅   7   136 5.1 10. ∅

然后我去绘图,一切都按字母顺序排列(我在工作中没有使用ggplot2,而是我链接到的包装器,但为了便于说明,ggplot2会这样做):

SNIPPET 3:

ggplot(allophone.count.test, aes(factor(Allophone), Count, fill = Region)) + 
    geom_bar(stat="identity", position = "dodge") + 
    scale_fill_brewer(palette = "Set1")

现在,我找到了一个部分解决方案,当Allophone的所有可能值都存在时, ONLY (即他们在特定的子样本I&#39; m中具有Count> 1在给定时间处理)。也就是说,手动将Allophone值的无数量版本分配为标签:

SNIPPET 4:

allophone.count.test$Allophone <- factor (allophone.count.test$Allophone, labels = c("[p]", "[p̚]", "[pʰ]", "[p͡ɸ]", "[ɸ]", "[b]", "[b͡β]", "[β]", "OTHER", "∅"))

然而,这是一个非常不稳健的解决方案 - Allophone有10个可能的值,并且它们并不总是存在于给定的子样本中(例如我在这里提供的那个) 。当发生这种情况时,R会停止运转。

是否有更强大的方式来执行我想要对标签执行的操作?(或者其他任何方式,就此而言?)

我能够提出的最好的尝试(我既不是程序员也不是统计学家)失败了 - 它为许多值分配了错误的标签(比较Allophone和Allophone.Backup出发在第三行):

SNIPPET 5:

allophone.count.test$Allophone <- factor (
    allophone.count.test$Allophone, labels = unique (
        gsub ("[0-9][0-9]\\. ", "", allophone.count.test$Allophone, perl=TRUE)
    )
)

    Region  Phoneme Allophone   Count   Total.Count Percentage  Allophone.Backup
1   LocationA   p   [p] 16  92  17.4    01. [p]
2   LocationA   p   [p̚]    4   92  4.3 02. [p̚]
3   LocationA   p   [b] 8   92  8.7 05. [ɸ]
4   LocationA   p   [β] 5   92  5.4 06. [b]
5   LocationA   p   OTHER   55  92  59.8    08. [β]
6   LocationA   p   ∅   1   92  1.1 09. OTHER
7   LocationA   p   [pʰ]    3   92  3.3 10. ∅
8   LocationB   p   [p] 19  136 14.0    01. [p]
9   LocationB   p   [ɸ] 1   136 0.7 03. [pʰ]
10  LocationB   p   [b] 14  136 10.3    05. [ɸ]
11  LocationB   p   [β] 7   136 5.1 06. [b]
12  LocationB   p   OTHER   88  136 64.7    08. [β]
13  LocationB   p   [pʰ]    7   136 5.1 10. ∅

以下几乎相同。它试图将无数字形式分配给Allophone作为标签。但它失败了:

SNIPPET 6:

allophone.count.test$Allophone <- factor (
    allophone.count.test$Allophone, labels = gsub ("[0-9][0-9]\\. ", "", allophone.count.test$Allophone, perl=TRUE)
)

Error in factor(allophone.count.test$Allophone, labels = gsub("[0-9][0-9]\\. ",  : 
  invalid 'labels'; length 13 should be 1 or 8

当我尝试创建等级以保持裸露的Allophone值时,我得到一个不同的错误:

SNIPPET 7:

allophone.count.test$Allophone <- factor (
    allophone.count.test$Allophone, levels = gsub ("[0-9][0-9]\\. ", "", allophone.count.test$Allophone, perl=TRUE)
)

Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated

我非常感谢任何人都能给予我的帮助。重申一下,所需的结果是一个条形图,其中当删除数字时,将保留Allophone矢量编号值的顺序。

(编辑:我已添加&#34; Snippet&#34;标题适用于任何想要引用特定代码的人,因为这个问题很长)。 < / p>

1 个答案:

答案 0 :(得分:1)

这是一个简化的例子,展示了这应该如何运作的逻辑:

# specify the order of the variable you want:
levs <- c("[p]", "[β]", "OTHER", "∅")

# here's some example data I prepared earlier:
test <- data.frame(
  Region = rep(c("LocationA","LocationB"), c(4,4)),
  Allophone = levs[c(1,3,2,4,3,2,1,4)],
  Count = c(16, 4, 8, 5, 55, 1, 3, 19),
  stringsAsFactors=FALSE
)

#     Region Allophone Count
#1 LocationA       [p]    16
#2 LocationA     OTHER     4
#3 LocationA       [ß]     8
#4 LocationA         Ø     5
#5 LocationB     OTHER    55
#6 LocationB       [ß]     1
#7 LocationB       [p]     3
#8 LocationB         Ø    19

# convert the Allophone variable with the specified order:
test$Allophone <- factor(test$Allophone, levels=levs)

# do the plotting:    
ggplot(test, aes(Allophone, Count, fill = Region)) + 
    geom_bar(stat="identity", position = "dodge") + 
    scale_fill_brewer(palette = "Set1")

enter image description here