在R中导入txt文件忽略前几行

时间:2015-11-30 14:21:59

标签: r read.table data-import

从MET办公室下载了苏格兰降雨量数据。

前几行:

Scotland Rainfall (mm)
Areal series, starting from 1910
Allowances have been made for topographic, coastal and urban effects where relationships are found to exist.
Seasons: Winter=Dec-Feb, Spring=Mar-May, Summer=June-Aug, Autumn=Sept-Nov. (Winter: Year refers to Jan/Feb).
Values are ranked and displayed to 1 dp. Where values are equal, rankings are based in order of year descending.
Data are provisional from February 2015 & Winter 2015. Last updated 26/11/2015

     JAN  Year     FEB  Year     MAR  Year     APR  Year     MAY  Year     JUN  Year     JUL  Year     AUG  Year    SEP   Year     OCT  Year     NOV  Year     DEC  Year     WIN  Year     SPR  Year     SUM  Year     AUT  Year     ANN  Year
   293.8  1993   278.1  1990   238.5  1994   191.1  1947   191.4  2011   155.0  1938   185.6  1940   216.5  1985   267.6  1950   258.1  1935   262.0  2009   300.7  2013   743.6  2014   409.5  1986   455.6  1985   661.2  1981  1886.4  2011
   292.2  1928   258.8  1997   233.4  1990   149.0  1910   168.7  1986   137.9  2002   181.4  1988   211.9  1992   221.2  1981   254.0  1954   244.8  1938   268.5  1986   649.5  1995   401.3  2015   435.6  1948   633.8  1954  1828.1  1990
   275.6  2008   244.7  2002   201.3  1992   146.8  1934   155.9  1925   137.8  1948   170.1  1939   202.3  2009   193.9  1982   248.8  2014   242.2  2006   267.2  1929   645.4  2000   393.7  1994   427.8  2009   615.8  1938  1756.8  2014

我正在尝试将此txt文件读入R并尝试以下操作:

fileURL <- "http://www.metoffice.gov.uk/pub/data/weather/uk/climate/datasets/Rainfall/ranked/Scotland.txt"

if(!file.exists("scotland_rainfall.txt")){
        #this will download the file in the current working directory
        download.file(fileURL,destfile = "scotland_rainfall.txt")
        dateDownload <- Sys.Date() #30-11-2015
}

scotland_weather <- read.table("scotland_rainfall.txt",skip = 8,header = F,sep = "\t",na.strings = "") 

它解释了各种因素的因素:

> head(scotland_weather)
                                                                                                                                                                                                                                              V1
1    293.8  1993   278.1  1990   238.5  1994   191.1  1947   191.4  2011   155.0  1938   185.6  1940   216.5  1985   267.6  1950   258.1  1935   262.0  2009   300.7  2013   743.6  2014   409.5  1986   455.6  1985   661.2  1981  1886.4  2011
2    292.2  1928   258.8  1997   233.4  1990   149.0  1910   168.7  1986   137.9  2002   181.4  1988   211.9  1992   221.2  1981   254.0  1954   244.8  1938   268.5  1986   649.5  1995   401.3  2015   435.6  1948   633.8  1954  1828.1  1990
3    275.6  2008   244.7  2002   201.3  1992   146.8  1934   155.9  1925   137.8  1948   170.1  1939   202.3  2009   193.9  1982   248.8  2014   242.2  2006   267.2  1929   645.4  2000   393.7  1994   427.8  2009   615.8  1938  1756.8  2014
4    252.3  2015   227.9  1989   200.2  1967   142.1  1949   149.5  2015   137.7  1931   165.8  2010   191.4  1962   189.7  2011   247.7  1938   231.3  1917   265.4  2011   638.3  2007   393.2  1967   422.6  1956   594.5  1935  1735.8  1938
5    246.2  1974   224.9  2014   180.2  1979   133.5  1950   137.4  2003   135.0  1966   162.9  1956   190.3  2014   189.7  1927   242.3  1983   229.9  1981   264.0  2006   608.9  1990   391.7  1992   397.0  2004   590.6  1982  1720.0  2008
6    245.0  1975   195.6  1995   180.0  1989   132.9  1932   129.7  2007   131.7  2004   159.9  1985   189.1  2004   189.6  1985   240.9  2001   224.9  1951   261.0  1912   592.8  2015   389.1  1913   390.1  1938   589.2  2006  1716.5  1954

> str(scotland_weather)
'data.frame':   106 obs. of  1 variable:
 $ V1: Factor w/ 106 levels "    38.6  1963    10.3  1932    28.7  1929    14.0  1974    22.5  1984    30.1  1988    32.7  1913     5.1  1947    31.7  1972 "| __truncated__,..: 106 105 104 103 102 101 100 99 98 97 ...

还尝试Header=T

> scotland_weather <- read.table("scotland_rainfall.txt",skip = 8,header = T,sep = "\t",na.strings = "")
> head(scotland_weather)
    X293.8..1993...278.1..1990...238.5..1994...191.1..1947...191.4..2011...155.0..1938...185.6..1940...216.5..1985...267.6..1950...258.1..1935...262.0..2009...300.7..2013...743.6..2014...409.5..1986...455.6..1985...661.2..1981..1886.4..2011
1    292.2  1928   258.8  1997   233.4  1990   149.0  1910   168.7  1986   137.9  2002   181.4  1988   211.9  1992   221.2  1981   254.0  1954   244.8  1938   268.5  1986   649.5  1995   401.3  2015   435.6  1948   633.8  1954  1828.1  1990
2    275.6  2008   244.7  2002   201.3  1992   146.8  1934   155.9  1925   137.8  1948   170.1  1939   202.3  2009   193.9  1982   248.8  2014   242.2  2006   267.2  1929   645.4  2000   393.7  1994   427.8  2009   615.8  1938  1756.8  2014
3    252.3  2015   227.9  1989   200.2  1967   142.1  1949   149.5  2015   137.7  1931   165.8  2010   191.4  1962   189.7  2011   247.7  1938   231.3  1917   265.4  2011   638.3  2007   393.2  1967   422.6  1956   594.5  1935  1735.8  1938
4    246.2  1974   224.9  2014   180.2  1979   133.5  1950   137.4  2003   135.0  1966   162.9  1956   190.3  2014   189.7  1927   242.3  1983   229.9  1981   264.0  2006   608.9  1990   391.7  1992   397.0  2004   590.6  1982  1720.0  2008
5    245.0  1975   195.6  1995   180.0  1989   132.9  1932   129.7  2007   131.7  2004   159.9  1985   189.1  2004   189.6  1985   240.9  2001   224.9  1951   261.0  1912   592.8  2015   389.1  1913   390.1  1938   589.2  2006  1716.5  1954
6    241.9  2005   194.8  1998   179.6  1921   132.3  1927   129.6  1920   130.4  1980   158.0  1953   188.8  1948   187.5  1935   238.1  2008   223.2  1986   260.8  1949   580.6  1920   386.5  1947   387.5  2012   587.8  1984  1696.7  2004
> str(scotland_weather)
'data.frame':   105 obs. of  1 variable:
 $ X293.8..1993...278.1..1990...238.5..1994...191.1..1947...191.4..2011...155.0..1938...185.6..1940...216.5..1985...267.6..1950...258.1..1935...262.0..2009...300.7..2013...743.6..2014...409.5..1986...455.6..1985...661.2..1981..1886.4..2011: Factor w/ 105 levels "    38.6  1963    10.3  1932    28.7  1929    14.0  1974    22.5  1984    30.1  1988    32.7  1913     5.1  1947    31.7  1972 "| __truncated__,..: 105 104 103 102 101 100 99 98 97 96 ...

我希望保留与txt文件相同的列名。

任何其他想法将不胜感激。

由于

1 个答案:

答案 0 :(得分:9)

看起来该文件确实有固定的宽度字段,但标题与数据行不一致,因此请分别读取标题和数据。不需要包裹。

hdr <- read.table(fileURL, skip = 7, nrow = 1, as.is = TRUE)
widths <- rep(c(8, 6), times = 17) # 8, 6, 8, 6, ..., 8, 6
dd <- read.fwf(fileURL, widths, skip = 8, col.names = hdr, check.names = FALSE)

注意:可以从数据的第一行计算widths,如下所示:

one.line <- readLines(fileURL, n = 9)[9] # char string with 1st line of data
widths <- diff(c(0, gregexpr("\\S(?=\\s)", paste(one.line, ""), perl = TRUE)[[1]]))