Question

我正在学习R而我正在尝试这个数据集。 http://ww2.amstat.org/publications/jse/datasets/airport.dat.txt

不幸的是，使用

ap <- read.table("http://ww2.amstat.org/publications/jse/datasets/airport.dat.txt")

确实会产生错误的结果。该文件是一个＆＃34;自由格式输入文件＆＃34;如此处所述。（http://data.princeton.edu/R/readingData.html）。通过该页面上给出的示例，我的简单代码应该可以工作..但它不会导致断行和错误的条目。怎么了？

谢谢。

Answer 1

您必须使用user_root_path并指定read.fwf，如下所示：

widths

Answer 2

读取固定宽度文件始终是一个挑战，因为用户需要弄清楚每列的宽度。为了完成这样的任务，我使用readr中的函数来简化过程。

读取固定宽度文件的主要功能是read_fwf。此外，还有一个名为fwf_empty的功能可以帮助用户猜测＆＃34;列宽。但是此功能可能无法始终正确识别列宽。这是一个例子。

# Load package
library(readr)

# Read the data
filepath <- "http://ww2.amstat.org/publications/jse/datasets/airport.dat.txt"

# Guess based on position of empty columns
col_pos <- fwf_empty(filepath)

# Read the data
dat <- read_fwf(filepath, col_positions = col_pos)

# Check the data frame
head(dat) 

# A tibble: 6 × 6
               X1                           X2     X3       X4        X5        X6
            <chr>                        <chr>  <int>    <int>     <dbl>     <dbl>
1 HARTSFIELD INTL ATLANTA               285693 288803 22665665 165668.76  93039.48
2 BALTO/WASH INTL BALTIMORE              73300  74048  4420425  18041.52  19722.93
3      LOGAN INTL BOSTON                114153 115524  9549585 127815.09  29785.72
4    DOUGLAS MUNI CHARLOTTE             120210 121798  7076954  36242.84  15399.46
5          MIDWAY CHICAGO                64465  66389  3547040   4494.78   4485.58
6     O'HARE INTL CHICAGO               322430 332338 25636383 300463.80 140359.38

fwf_empty可以很好地识别除第2列和第3列之外的所有列。它假定它们来自同一列。所以我们需要一些额外的工作。

fwf_empty的输出是4个元素的列表，显示已识别的开始和结束位置，跳过和列名称。我们必须更新开始和结束位置以考虑第2列和第3列的存在。

# Extract the begin position
Begin <- col_pos$begin

# Extract the end position
End <- col_pos$end

# Update the position information
Begin <- c(Begin[1:2], 43, Begin[3:6])
End <- c(End[1], 42, End[2:6])

# Update col_pos
col_pos$begin <- Begin
col_pos$end <- End
col_pos$col_names <- paste0("X", 1:7)

现在我们再次阅读数据。

dat2 <- read_fwf(filepath, col_positions = col_pos)
head(dat2)

# A tibble: 6 × 7
               X1        X2     X3     X4       X5        X6        X7
            <chr>     <chr>  <int>  <int>    <int>     <dbl>     <dbl>
1 HARTSFIELD INTL   ATLANTA 285693 288803 22665665 165668.76  93039.48
2 BALTO/WASH INTL BALTIMORE  73300  74048  4420425  18041.52  19722.93
3      LOGAN INTL    BOSTON 114153 115524  9549585 127815.09  29785.72
4    DOUGLAS MUNI CHARLOTTE 120210 121798  7076954  36242.84  15399.46
5          MIDWAY   CHICAGO  64465  66389  3547040   4494.78   4485.58
6     O'HARE INTL   CHICAGO 322430 332338 25636383 300463.80 140359.38

这次read_fwf功能可以成功读取文件。

这个数据集有什么问题？

2 个答案: