Question

我通过编译来自多个来源的数据创建了一个面板数据集。但是为什么将变量local_aus，hyv_aus和hyv_aman视为字符而不是数字？我用了： mutate(local_aus = as.numeric(local_aus)，hyv_aus = as.numeric(hyv_aus)，hyv_aman = as.numeric(hyv_aman))

但是，R显示warning messages: NAs introduced by coercion。但是为什么将这些数字视为字符？

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   856 obs. of  24 variables:
 $ district             : num  11704 10408 11921 12007 11313 ...
 $ year                 : num  1970 1970 1970 1970 1970 1970 1970 1970 1970 1970 ...
 $ local_aus            : chr  "178145" "94390" "119375" "56375" ...
 $ hyv_aus              : chr  "3010" "850" "2095" "3785" ...
 $ broadcast_aman       : num  70325 9435 33340 1495 316580 ...
 $ local_transplant_aman: num  673060 270550 282655 35825 188655 ...
 $ hyv_aman             : chr  "3185" "920" "3080" "820" ...
 $ local_boro           : num  6450 12050 41430 14450 45970 ...
 $ hyv_boro             : num  67930 10630 121340 15640 116500 ...
 $ danger_days_aus      : num  0 0 142 4 108 434 5 36 33 1 ...
 $ benefit_days_aus     : num  0 0 9 0 21 110 0 0 0 0 ...
 $ danger_days_aman     : num  0 0 32 0 43 218 0 0 29 2 ...
 $ benefit_days_aman    : num  0 0 89 0 110 426 3 52 53 2 ...
 $ danger_days_boro     : num  0 0 1 0 0 0 0 0 0 0 ...
 $ benefit_days_boro    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ abovemax_aus         : num  2 25 1 37 4 18 29 19 45 42 ...
 $ belowmin_aus         : num  1 1 2 4 2 0 3 3 2 0 ...
 $ abovemax_aman        : num  0 0 0 0 1 0 2 1 1 6 ...
 $ belowmin_aman        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ abovemax_boro        : num  2 7 0 10 1 8 4 7 5 12 ...
 $ belowmin_boro        : num  116 123 107 92 76 115 138 125 124 89 ...
 $ rain_aus             : num  5969 1088 6902 5637 3831 ...
 $ rain_aman            : num  5477 650 5806 2291 2900 ...
 $ rain_boro            : num  601.6 38.1 1067.3 381 387.4 ...

Answer 1

正如其他人所提到的，您的数值中很可能有一些不规则的NA。您也可能使用错误的十进制格式导入了CSV。查看local_aus列中的3-5行。值-，563,75和none都将导致R将该列强制转换为“字符”类：

# A tibble: 5 x 2
   year local_aus
  <int> <chr>    
1  1970 178145   
2  1970 94390    
3  1970 -        
4  1970 563,75   
5  1970 none

如果运行as.numeric(df$local_aus)，将收到与上述相同的警告。您可以使用正则表达式查找有问题的值（假设这些值应该是整数）：

> df$local_aus[!grepl("^\\d+$", df$local_aus)]
[1] "-"      "563,75" "none"

最好在致电read.*或readr::read_*时处理这些问题。这是两个示例，可以正确导入上述示例数据框：

# using base R
df <- read.table("example.txt",
                 header = T,
                 stringsAsFactors = F,
                 dec = ",",
                 na.strings = c("-", "none")
                 )

# using readr library
df <- readr::read_table("example.txt",
                        locale = locale(decimal_mark = ","),
                        na = c("-", "none")
                        )

#### OUTPUT ####

df

# A tibble: 5 x 2
   year local_aus
  <dbl>     <dbl>
1  1970   178145 
2  1970    94390 
3  1970       NA 
4  1970      564.
5  1970       NA

为什么有时在R中将数字视为字符？

1 个答案: