Question

我试图了解如何将BLS数据库中的一些文本文件读入R中。

url <- "http://download.bls.gov/pub/time.series/oe/oe.datatype"
datatype <- read.table(url)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :line 1 
did not have 6 elements

我也尝试过：

datatype <- read.table(url, header = FALSE)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  
:line 1 did not have 6 elements

和

datatype <- read.table(url, sep="\t")

这最后一种方法几乎可行，但是当我检查数据框时，看起来第一列已经转换为行名，最后一列填充了NA＆＃39>。

datatype
                          datatype_code datatype_name
01                                 Employment            NA
02 Employment percent relative standard error            NA
03                           Hourly mean wage            NA
04                           Annual mean wage            NA

我也尝试下载并检查文件，但我不确定我在Notepad ++中看到了什么。

download.file(url, "datatype.txt")
datatype <- read.table("datatype.txt", sep='\t')

datatype
                                datatype_code datatype_name
01                                 Employment            NA
02 Employment percent relative standard error            NA
03                           Hourly mean wage            NA
04                           Annual mean wage            NA

感谢您的任何提示。只是想学习。

Answer 1

正如@ zx8754所指出的，这个特殊文件有一个额外的制表符＆＃34; \ t＆＃34;在每一行中，但标题行除外。

您可以在没有标题的情况下阅读该文件：

url <- "http://download.bls.gov/pub/time.series/oe/oe.datatype"
df <- read.delim(url, skip = 1, header = FALSE)
head(df)
#   V1                                         V2 V3
# 1  1                                 Employment NA
# 2  2 Employment percent relative standard error NA
# 3  3                           Hourly mean wage NA
# 4  4                           Annual mean wage NA
# 5  5       Wage percent relative standard error NA
# 6  6                Hourly 10th percentile wage NA

您还可以分别在第一行中读取标题：

header <- read.delim(url, nrows = 1, header = FALSE, stringsAsFactors = FALSE)
names(df) <- header
head(df)
#   datatype_code                              datatype_name NA
# 1             1                                 Employment NA
# 2             2 Employment percent relative standard error NA
# 3             3                           Hourly mean wage NA
# 4             4                           Annual mean wage NA
# 5             5       Wage percent relative standard error NA
# 6             6                Hourly 10th percentile wage NA

此时您可能想要删除第三列：

df <- df[-3]

Answer 2

这是一个很好用的tidyverse选项。事实证明，readr :: read_tsv可以有效地处理这个问题。

library(tidyverse)
df <- read_tsv(url)
head(df)
# A tibble: 6 x 2
  datatype_code                              datatype_name
          <chr>                                      <chr>
1            01                                 Employment
2            02 Employment percent relative standard error
3            03                           Hourly mean wage
4            04                           Annual mean wage
5            05       Wage percent relative standard error
6            06                Hourly 10th percentile wage

将简单文本文件读入R - BLS数据

2 个答案: