Question

我正在读取一个看起来像这样的数据集：

我的代码如下：

NatPark <- read.delim (paste0(dirdata,"NatPark_Plus.dat"),
                  header= TRUE, 
                  sep = "\t",
                  quote = "\"",
                  dec = ".",
                  fill = TRUE,
                  as.is = c("ParkName", "State"))

然后我得到如下警告：

警告消息： 1：在read.table（file = file，header = header，sep = sep，quote = quote，： readTableHeader在'/Volumes/Elements/STAT_611/611/DATA/DATA11/NatPark_Plus.dat'上找到的最后一行不完整 2：在read.table（file = file，header = header，sep = sep，quote = quote，：并非所有以“ as.is”命名的列都存在

因此我将“ header = TRUE”更改为“ header = FALSE”，如下：

   NatPark <- read.delim (paste0(dirdata,"NatPark_Plus.dat"),
                    header= FALSE, 
                      sep = "\t",
                      quote = "\"",
                      dec = ".",
                      fill = TRUE,
                      as.is = c("ParkName", "State"))

我有同样的警告消息：

警告消息： 1：在read.table（file = file，header = header，sep = sep，quote = quote，： readTableHeader在'/Volumes/Elements/STAT_611/611/DATA/DATA11/NatPark_Plus.dat'上找到的最后一行不完整 2：在read.table（file = file，header = header，sep = sep，quote = quote，：并非所有以“ as.is”命名的列都存在

这次所有的行号都出现了，如下所示。但是，我不明白str（NatPark）的含义。那是什么“ v1”？然后是“ 4 1 5 2 3”？感谢您的任何建议！谢谢！

Answer 1

我对.dat文件的处理不多，但是如果您可以共享下载链接，则可以帮助您进行进一步的故障排除。到目前为止，我可以提供这些见解：

V1（以及V2，V3，V4 ...）是指在没有标题的情况下R自动分配的列名。由于只有V1，因此R当然认为您只有1列具有当前设置。
从"4 1 5 2 3"的输出中看到的str指的是自该因子变量以来的数字级别（在这种情况下，整行被视为一个变量）。默认情况下，R始终按字母顺序对级别进行排序。来自虹膜数据集的示例应有助于阐明：

str(iris)
#> 'data.frame':    150 obs. of  5 variables:
#>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris$Species)
#> [1] setosa setosa setosa setosa setosa setosa
#> Levels: setosa versicolor virginica
levels(iris$Species)
#> [1] "setosa"     "versicolor" "virginica"

由reprex package（v0.2.0）于2018-08-18创建。

您可以看到值setosa被认为是1，因为它的第一级setosa是2，而virginica是3。但是，这应该都是有争议的一点，因为您不想将整行都读为一个变量。

Answer 2

关于您的主要问题，我能够组合一个自定义函数来仅解析您的数据。将来，如果可以选择在源数据中引用文本，那么事情可能会简单得多。无论如何，希望这对您有用！您只需要设置列名称，然后将某些列从字符更改为数字即可。

library(tidyverse)
library(stringr)

directory <- "/Users/jas/Desktop"
filename <- "NatPark_Plus.dat"
file <- file.path(directory, filename)

# tabs
data <- read.delim(file, header = FALSE, sep = "\t")
#> Warning in read.table(file = file, header = header, sep = sep, quote =
#> quote, : incomplete final line found by readTableHeader on '/Users/jas/
#> Desktop/NatPark_Plus.dat'

# We have 5 records, but the spacing amongst them is uneven and some words with spaces

text <- data$V1

# Parse text to make same number of columns - 4
# Creates a separate dataframe for each row
parse_text_to_df <- function(x) {
  # Find more than one spaces and replace with tab
  x <- gsub("[ ]{2,}", "\t", x)
  # replace remaining space with tab (cannot use comma since numbers have comma)
  x <- gsub(" ", "\t", x)
  # Should be only 3 tabs on each line - WORKS FOR THIS DATASET ONLY
  total_tabs <- stringr::str_count(x, "\t")
  # If we have those words with spaces, we need to remove the extra tabs between them
  if (total_tabs[1] > 3) {
    num_tabs_to_remove <- total_tabs - 3
    for (i in range(num_tabs_to_remove)) {
      x <- sub("\t", " ", x)
    }
  }
  # Convert to an object that can be read back into a dataframe
  x <- readLines(textConnection(x))
  df <- read.delim(text = x, header = FALSE, sep = "\t") %>%
    mutate_all(as.character)
  return(df)
}

# Combine each of the 1 row dataframes into one dataframe (all character vectors)
df <- text %>% map_df(parse_text_to_df)
df
#>                      V1       V2   V3        V4
#> 1           Yellowstone ID/MT/WY 1872 4,065,493
#> 2            Everglades       FL 1934 1,398,800
#> 3              Yosemite       CA 1864   760,917
#> 4 Great Smoky Mountains    NC/TN 1926   520,269
#> 5        Wolf Trap Farm       VA 1966       130

由reprex package（v0.2.0）于2018-08-18创建。

第一个观测值没有行号

2 个答案: