Question

我在将Gene Symbols转换为Entrez ID时遇到了一个奇怪的错误。这是我的代码：

testData = read.delim("IL_CellVar.txt",head=T,row.names = 2)
testData[1:5,1:3]

# ClustID Genes.Symbol  ChrLoc
# NM_001034168.1       4         Ank2 chrNA:-1--1
# NM_013795.4          4        Atp5l chrNA:-1--1
# NM_018770            4       Igsf4a chrNA:-1--1
# NM_146150.2          4         Nrd1 chrNA:-1--1
# NM_134065.3          4        Epdr1 chrNA:-1--1

clustNum = 5
filteredClust = testData[testData$ClustID == clustNum,]

any(is.na(filteredClust$Genes.Symbol))
# [1] FALSE

selectedEntrezIds <- unlist(mget(filteredClust$Genes.Symbol,org.Mm.egSYMBOL2EG))

# Error in unlist(mget(filteredClust$Genes.Symbol, org.Mm.egSYMBOL2EG)) :
#  error in evaluating the argument 'x' in selecting a method for function 
#     'unlist': Error in #.checkKeysAreWellFormed(keys) :
#  keys must be supplied in a character vector with no NAs

另一种方法也失败了：

selectedEntrezIds = select(org.Mm.eg.db,filteredClust$Genes.Symbol, "ENTREZID")

# Error in .select(x, keys, columns, keytype = extraArgs[["kt"]], jointype = jointype) :
#   'keys' must be a character vector

只是为了清除或错误，删除'NA'并没有帮助：

a <- filteredClust$Genes.Symbol[!is.na(filteredClust$Genes.Symbol)]
selectedEntrezIds <- unlist(mget(a,org.Mm.egSYMBOL2EG))

# Error in unlist(mget(a, org.Mm.egSYMBOL2EG)) : 
#   error in evaluating the argument 'x' in selecting a method for function 
#      'unlist': Error in # .checkKeysAreWellFormed(keys) : 
#  keys must be supplied in a character vector with no NAs

我不确定为什么我会收到此错误，因为为testData提取基因符号的主文件在转换为EntrezID时没有问题。会对此表示赞赏。

Answer 1

由于您没有为我们复制您所经历的错误提供最小的可重复示例，因此我根据错误消息在此进行推测。这很可能是由read.delim和函数（read.csv，read.table等）的默认行为引起的，它们会将数据文件中的字符串转换为factor＆＃39;秒。

您需要向read.delim添加额外参数，具体而言，stringsAsFactors=F（默认情况下为TRUE）。

即，

testData = read.delim("IL_CellVar.txt", head=T, row.names = 2, stringsAsFactors=F)

如果您阅读文档：

stringsAsFactors
逻辑：字符向量应该转换为因子吗？请注意，这被as.is和colClasses覆盖，两者都允许更精细的控制。

您可以通过以下方式查看class列的Gene.symbol

class(testData$Gene.Symbol)

我猜它会"factor"。

这会导致您遇到错误：

# Error in .select(x, keys, columns, keytype = extraArgs[["kt"]], jointype = jointype) :
#   'keys' must be a character vector

您也可以通过以下方式手动将因子转换为字符串/字符：

testData$Gene.Symbol <- as.character(testData$Gene.Symbol)

您可以阅读有关此特殊行为的更多信息in this chapter of Hadley's book "Advanced R"。我在这里引用相关段落：

...不幸的是，R中的大多数数据加载函数会自动将字符向量转换为因子。这不是最理想的，因为这些函数无法知道所有可能级别或其最佳顺序的集合。相反，使用参数stringsAsFactors = FALSE来抑制此行为，然后使用您对数据的了解手动将字符向量转换为因子。一个全局选项，options（stringsAsFactors = FALSE）可用于控制此行为，但我不建议使用它。当与其他代码（来自包或源代码的代码）结合使用时，更改全局选项可能会产生意想不到的后果，而全局选项会使代码更难理解，因为它们会增加您需要读取的行数了解单行代码的行为方式。 ...

将SYMBOLS映射到ENTREZID时出错

1 个答案: