Question

我有一个异常布局的csv文件。数据不是顶部的重点块。 csv文件的特征如下：

Comment Strings
Empty row
Comment String

[Desired Data with 10 columns and an undetermined number of rows]

Empty Row

Comment String 

[Desired Data with 10 columns and an undetermined number of rows]

Empty Row

Comment String

[有10列和未确定列数的所需数据]

.... and so on and so forth.

如上所述。每个数据块都有一个随机行数。

将这些数据导入R的最佳方法是什么？ read.table/read.csv只能做这么多。

 read.table("C:\\Users\\Riemmman\\Desktop\\Historical Data\\datafile.csv",header=F,sep=",",skip=15,blank.lines.skip=T)

Answer 1

我刚刚遇到这样的问题。我的解决方案是使用awk分离不同类型的行，将它们加载到dbms中的不同表中，并使用sql创建一个平面文件以加载到R中。

或者，如果您不关心注释字符串，也许您只能删除所需的数据并加载它。

Answer 2

您可以使用readLines和grep / grepl的组合来帮助您找出要阅读的行。

这是一个例子。第一部分只是为了编制一些样本数据。

创建一些示例数据。

x <- tempfile(pattern="myFile", fileext=".csv")

cat("junk comment strings",
    "",
    "another junk comment string",
    "This,Is,My,Data",
    "1,2,3,4",
    "5,6,7,8",
    "",
    "back to comments",
    "This,Is,My,Data",
    "12,13,14,15",
    "15,16,17,18",
    "19,20,21,22", file = x, sep = "\n")

步骤1：使用`readLines()`将数据导入R

在这一步中，我们还将删除我们不感兴趣的行。逻辑是我们只对以（以四列数据集）形式存在信息的行感兴趣：

某些逗号用逗号代替逗号的东西

## Read the data into R
## Replace "con" with the actual path to your file
A <- readLines(con = x)

## Find and extract the lines where there are "data".
## My example dataset only has 4 columns.
## Modify for your actual dataset.
A <- A[grepl(paste(rep(".*", 4), collapse=","), A)]

第2步：确定数据范围

## Identify the header rows. -1 for use with read.csv
HeaderRows <- grep("^This,Is", A)-1

## Identify the number of rows per data group
N <- c(diff(HeaderRows)-1, length(A)-1)

第3步：读取

中的数据

使用数据范围信息指定在阅读之前要跳过的行数以及要读取的行数。

myData <- lapply(seq_along(HeaderRows), 
       function(x) read.csv(text = A, header = TRUE, 
                            nrows = N[x], skip = HeaderRows[x]))
myData
# [[1]]
#   This Is My Data
# 1    1  2  3    4
# 2    5  6  7    8
# 
# [[2]]
#   This Is My Data
# 1   12 13 14   15
# 2   15 16 17   18
# 3   19 20 21   22

如果您想将所有这些内容放在一个data.frame而不是list中，请使用：

final <- do.call(rbind, myData)

Answer 3

使用@Ananda Mahto生成的数据，

file = x # change for the actual file name
alldata = readLines(file) # read all data
# count the fields in data (separated by comma)
nfields = count.fields(file=textConnection(alldata), sep=",", blank.lines.skip=FALSE) 
# asumme data has the 'mode' of the number of fields (can change for the actual number of colums)
dataFields = as.numeric(names(table(nfields))[which.max(table(nfields))]) 

alldata = alldata[nfields == dataFields] # read data lines only
header = alldata[1] # the header
alldata = c(header, alldata[alldata!=header]) # remove the extra headers
datos = read.csv(text=alldata) # read the data

  This Is My Data
1    1  2  3    4
2    5  6  7    8
3   12 13 14   15
4   15 16 17   18
5   19 20 21   22

将分解的数据块拉到R中

3 个答案:

创建一些示例数据。

步骤1：使用`readLines()`将数据导入R

第2步：确定数据范围

第3步：读取

将分解的数据块拉到R中

3 个答案:

创建一些示例数据。

步骤1：使用readLines()将数据导入R

第2步：确定数据范围

第3步：读取

步骤1：使用`readLines()`将数据导入R