在r数据帧中设置列类时,摆脱####到NA错误

时间:2014-10-17 16:04:07

标签: r excel csv dataframe na

我正在使用最初在excel中格式化的csv文件。我想将费率列转换为数字并删除“$”符号。

我在文件中读到: > NImp <- read.csv("National_TV_Spots 6_30_14 to 8_31_14.csv", sep=",", header=TRUE, stringsAsFactors=FALSE, strip.white=TRUE, na.strings=c("Not Monitored"))

数据框如下所示:

HH.IMP..000.       ISCI                                          Creative          Program  Rate
1           NA     IT3896 Rising Costs30 (Opportunity Scholar - No Nursing)      NUVO CINEMA $0.00
2           NA     IT3896 Rising Costs30 (Opportunity Scholar - No Nursing)      NUVO CINEMA $0.00
3          141    IT14429 Rising Costs30 (Opportunity Scholar - No Nursing)            BONUS $0.00
4          476 ITES15443H     Matthew Traina (B. EECT/A. CEET) :60 (no loc) Law & Order: SVU $0.00
5           NA     IT3896 Rising Costs30 (Opportunity Scholar - No Nursing)      NUVO CINEMA $0.00

当我进行转换时,收到一条错误消息:> NImp$Rate <- as.numeric(gsub("$","", NImp$Rate)) Warning message: NAs introduced by coercion并且所有值都被强制转换为NA。

我也尝试了,NImp$Rate <- as.numeric(sub("\\$","", NImp$Rate))但又得到了相同的警告信息。然而,并非所有的值都成为NAs - 只有特定的值。我打开excel中的csv进行检查,我意识到excel强制csv列宽太窄导致“####”单元格。这些细胞被r强制为“NA”。

我尝试了在记事本中打开文件的选项,并将记事本文件读入r。但我得到了相同的结果。这些值在记事本和我将文件读入r时正确显示。但是当我更改为数字时,excel中显示为“####”的所有内容都变为NA

我该怎么办?

添加str(NImp)

'data.frame':   9859 obs. of  19 variables:
$ Spot.ID         : int  13072903 13072904 13072898 13072793 13072905 13072899 13072397 13072476 13072398 13072681 ...
$ Date            : chr  "6/30/2014" "6/30/2014" "6/30/2014" "6/30/2014" ...
$ Hour            : int  0 0 0 0 0 0 1 1 1 2 ...
$ Time            : chr  "12:08 AM" "12:20 AM" "12:29 AM" "12:30 AM" ...
$ Local.Date      : chr  "6/30/2014" "6/30/2014" "6/30/2014" "6/30/2014" ...
$ Broadcast.Week  : int  1 1 1 1 1 1 1 1 1 1 ...
$ Local.Hour      : int  0 0 0 0 0 0 1 1 1 2 ...
$ Local.Time      : chr  "12:08 AM" "12:20 AM" "12:29 AM" "12:30 AM" ...
$ Market          : chr  "NATIONAL CABLE" "NATIONAL CABLE" "NATIONAL CABLE" "NATIONAL CABLE" ...
$ Vendor          : chr  "NUVO" "NUVO" "AFAM" "USA" ...
$ Station         : chr  "NUVO" "NUVO" "AFAM" "USA" ...
$ M18.34.IMP..000.: int  NA NA 3 88 NA 3 NA 53 NA 37 ...
$ W18.34.IMP..000.: int  NA NA 86 66 NA 86 NA 70 NA 60 ...
$ A18.34.IMP..000.: int  NA NA 89 154 NA 89 NA 123 NA 97 ...
$ HH.IMP..000.    : int  NA NA 141 476 NA 141 NA 461 NA 434 ...
$ ISCI            : chr  "IT3896" "IT3896" "IT14429" "ITES15443H" ...
$ Creative        : chr  "Rising Costs30 (Opportunity Scholar - No Nursing)" "Rising Costs30 (Opportunity Scholar - No Nursing)" "Rising Costs30 (Opportunity Scholar - No Nursing)" "Matthew Traina (B. EECT/A. CEET) :60 (no loc)" ...
$ Program         : chr  "NUVO CINEMA" "NUVO CINEMA" "BONUS" "Law & Order: SVU" ...
$ Rate            : chr  "$0.00" "$0.00" "$0.00" "$0.00" ...

1 个答案:

答案 0 :(得分:1)

在Excel中将列设置为“货币”时,数千或更大的值中包含逗号以及美元符号前缀。例如,值可能看起来像$1,200.00。您遇到的问题是因为您删除了美元符号而不是逗号,所以当您尝试转换为numeric时,您会获得NA

as.numeric(c("0", "0", "1,200"))
[1]  0  0 NA
Warning message:
NAs introduced by coercion 

您可以使用gsub一步删除美元符号和逗号。我在this answer的评论中找到了如何执行此操作的示例。

as.numeric(gsub("[$,]", "", c("$0", "$0", "$1,200")))
[1]    0    0 1200

因此应该适用于您的数据集的代码是

as.numeric(gsub("[$,]", "", NImp$Rate))