我有一个包含大约3000个观察结果的数据框。我不仅要分析整个,还要分析子样本,我创建如下:
SNIIPPET 1:
allophone.count.test <- subset (merged.data.for.study, Environment %in% curr.phon.env)
我有数据,其中某个类别(&#34; Allophone&#34;在下面的数据中)的值需要以条形图的特定顺序显示,如下所示:
[p], [p̚], [pʰ], [p͡ɸ], [ɸ], [b], [b͡β], [β], OTHER, ∅
为了建立正确的顺序,我在部分数据处理过程中为上述值分配了数字。这些值看起来像这样:
01. [p], 02. [p̚], 03. [pʰ], 04. [p͡ɸ], 05. [ɸ], 06. [b], 07. [b͡β], 08. [β], 09. OTHER, 10. ∅
这是一个示例数据框。请注意,在此阶段,Allophone和Allophone.Backup包含相同的值,以便以后检查错误:
allophone.count.test <- read.table(
header=TRUE, sep="\t", text='Region Phoneme Allophone Count Total.Count Percentage Allophone.Backup
LocationA p 01. [p] 16 92 17.4 01. [p]
LocationA p 02. [p̚] 4 92 4.3 02. [p̚]
LocationA p 05. [ɸ] 8 92 8.7 05. [ɸ]
LocationA p 06. [b] 5 92 5.4 06. [b]
LocationA p 08. [β] 55 92 59.8 08. [β]
LocationA p 09. OTHER 1 92 1.1 09. OTHER
LocationA p 10. ∅ 3 92 3.3 10. ∅
LocationB p 01. [p] 19 136 14 01. [p]
LocationB p 03. [pʰ] 1 136 0.7 03. [pʰ]
LocationB p 05. [ɸ] 14 136 10.3 05. [ɸ]
LocationB p 06. [b] 7 136 5.1 06. [b]
LocationB p 08. [β] 88 136 64.7 08. [β]
LocationB p 10. ∅ 7 136 5.1 10. ∅'
)
这一切都很好,我尝试过的所有绘图工具(barplot,ggplot2和我目前正在使用的wrapper)按字母数字顺序排列这些值,所以当我用添加的数字绘制数据一切都很好。不幸的是,这些数字使这些情节看起来荒谬可笑,并且不会被接受发表。所以我需要在摆脱数字的同时保持正确的顺序。
问题在于,一旦我尝试绘制没有数字的值,我尝试的所有绘图工具都会按字母顺序排序。
我发现的大多数针对此问题的解决方案都说要将其转换为一个因素。以下是我用来(1)将其转换为因子和(2)除去前导数+期间+空间的代码:
SNIPPET 2:
allophone.count.test$Allophone <- factor (allophone.count.test$Allophone)
allophone.count.test$Allophone <- gsub ("[0-9][0-9]\\. ", "", allophone.count.test$Allophone, perl=TRUE)
这个看起来像一样,如下所示,因为Allophone值没有前导数字,句号或空格,并且保留了正确的顺序:
Region Phoneme Allophone Count Total.Count Percentage Allophone.Backup
1 LocationA p [p] 16 92 17.4 01. [p]
2 LocationA p [p̚] 4 92 4.3 02. [p̚]
3 LocationA p [ɸ] 8 92 8.7 05. [ɸ]
4 LocationA p [b] 5 92 5.4 06. [b]
5 LocationA p [β] 55 92 59.8 08. [β]
6 LocationA p OTHER 1 92 1.1 09. OTHER
7 LocationA p ∅ 3 92 3.3 10. ∅
8 LocationB p [p] 19 136 14.0 01. [p]
9 LocationB p [pʰ] 1 136 0.7 03. [pʰ]
10 LocationB p [ɸ] 14 136 10.3 05. [ɸ]
11 LocationB p [b] 7 136 5.1 06. [b]
12 LocationB p [β] 88 136 64.7 08. [β]
13 LocationB p ∅ 7 136 5.1 10. ∅
然后我去绘图,一切都按字母顺序排列(我在工作中没有使用ggplot2,而是我链接到的包装器,但为了便于说明,ggplot2会这样做):
SNIPPET 3:
ggplot(allophone.count.test, aes(factor(Allophone), Count, fill = Region)) +
geom_bar(stat="identity", position = "dodge") +
scale_fill_brewer(palette = "Set1")
现在,我找到了一个部分解决方案,当Allophone的所有可能值都存在时, ONLY (即他们在特定的子样本I&#39; m中具有Count> 1在给定时间处理)。也就是说,手动将Allophone值的无数量版本分配为标签:
SNIPPET 4:
allophone.count.test$Allophone <- factor (allophone.count.test$Allophone, labels = c("[p]", "[p̚]", "[pʰ]", "[p͡ɸ]", "[ɸ]", "[b]", "[b͡β]", "[β]", "OTHER", "∅"))
然而,这是一个非常不稳健的解决方案 - Allophone有10个可能的值,并且它们并不总是存在于给定的子样本中(例如我在这里提供的那个) 。当发生这种情况时,R会停止运转。
是否有更强大的方式来执行我想要对标签执行的操作?(或者其他任何方式,就此而言?)
我能够提出的最好的尝试(我既不是程序员也不是统计学家)失败了 - 它为许多值分配了错误的标签(比较Allophone和Allophone.Backup出发在第三行):
SNIPPET 5:
allophone.count.test$Allophone <- factor (
allophone.count.test$Allophone, labels = unique (
gsub ("[0-9][0-9]\\. ", "", allophone.count.test$Allophone, perl=TRUE)
)
)
Region Phoneme Allophone Count Total.Count Percentage Allophone.Backup
1 LocationA p [p] 16 92 17.4 01. [p]
2 LocationA p [p̚] 4 92 4.3 02. [p̚]
3 LocationA p [b] 8 92 8.7 05. [ɸ]
4 LocationA p [β] 5 92 5.4 06. [b]
5 LocationA p OTHER 55 92 59.8 08. [β]
6 LocationA p ∅ 1 92 1.1 09. OTHER
7 LocationA p [pʰ] 3 92 3.3 10. ∅
8 LocationB p [p] 19 136 14.0 01. [p]
9 LocationB p [ɸ] 1 136 0.7 03. [pʰ]
10 LocationB p [b] 14 136 10.3 05. [ɸ]
11 LocationB p [β] 7 136 5.1 06. [b]
12 LocationB p OTHER 88 136 64.7 08. [β]
13 LocationB p [pʰ] 7 136 5.1 10. ∅
以下几乎相同。它试图将无数字形式分配给Allophone作为标签。但它失败了:
SNIPPET 6:
allophone.count.test$Allophone <- factor (
allophone.count.test$Allophone, labels = gsub ("[0-9][0-9]\\. ", "", allophone.count.test$Allophone, perl=TRUE)
)
Error in factor(allophone.count.test$Allophone, labels = gsub("[0-9][0-9]\\. ", :
invalid 'labels'; length 13 should be 1 or 8
当我尝试创建等级以保持裸露的Allophone值时,我得到一个不同的错误:
SNIPPET 7:
allophone.count.test$Allophone <- factor (
allophone.count.test$Allophone, levels = gsub ("[0-9][0-9]\\. ", "", allophone.count.test$Allophone, perl=TRUE)
)
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
duplicated levels in factors are deprecated
我非常感谢任何人都能给予我的帮助。重申一下,所需的结果是一个条形图,其中当删除数字时,将保留Allophone矢量编号值的顺序。
(编辑:我已添加&#34; Snippet&#34;标题适用于任何想要引用特定代码的人,因为这个问题很长)。 < / p>
答案 0 :(得分:1)
这是一个简化的例子,展示了这应该如何运作的逻辑:
# specify the order of the variable you want:
levs <- c("[p]", "[β]", "OTHER", "∅")
# here's some example data I prepared earlier:
test <- data.frame(
Region = rep(c("LocationA","LocationB"), c(4,4)),
Allophone = levs[c(1,3,2,4,3,2,1,4)],
Count = c(16, 4, 8, 5, 55, 1, 3, 19),
stringsAsFactors=FALSE
)
# Region Allophone Count
#1 LocationA [p] 16
#2 LocationA OTHER 4
#3 LocationA [ß] 8
#4 LocationA Ø 5
#5 LocationB OTHER 55
#6 LocationB [ß] 1
#7 LocationB [p] 3
#8 LocationB Ø 19
# convert the Allophone variable with the specified order:
test$Allophone <- factor(test$Allophone, levels=levs)
# do the plotting:
ggplot(test, aes(Allophone, Count, fill = Region)) +
geom_bar(stat="identity", position = "dodge") +
scale_fill_brewer(palette = "Set1")