Question

我需要从文本文件中读取数据（其中很多都非常大），通常如下所示：

＃2013＃3090050010＃CCOU＃01＃022＃1＃N＃16/03/2015 ＃2013＃3090050010＃CCOU＃01＃023＃1 ## 16/03/2015 ＃2013＃3090050010＃CCOU＃02＃005＃1＃1692528＃16/03/2015 ＃2013＃3090430110＃CCOU＃15＃504＃2＃blablablablablablablablablablablablablab labla第01 /二千零十四分之十

＃2013＃3090430110＃CCOU＃15＃505＃2 ## 01 /二千零十四分之十

所以＆＃34;＃＆＃34;是分隔符，有时长记录使用两行。我有一个解决方法，我忽略了不以＆＃34;＃＆＃34;开头的行，使用grep：

x<-readLines("data.txt")
y <- grep("^#",x)
app<-x[y]
NamesForCols<-c("..",...) 
myDat<-read.table(text=app,header =F,sep="#",quote="",col.names = NamesForCols, colClasses=c("NULL", "factor", NA,NA,NA,NA,NA,"character","NULL"), fill=T,blank.lines.skip=T,comment.char = "",allowEscapes = T)

但我对此解决方案不满意（重要数据丢失）。有没有办法读取data.txt文件，以便每个记录必然需要满足＆＃34;＃＆＃34;符号正好8次，即使这意味着有时会访问两行？任何其他建议都会受到欢迎。谢谢！

Answer 1

您可以执行以下操作：

现在将不以x <- strsplit(text, "\n")[[1]] # starts with # or is empty ind <- cumsum(pmax(grepl("^#",x), x=="")) x_collapsed <- vapply(split(x, ind), paste0, character(1), collapse = "") x_collapsed <- paste(x_collapsed, collapse = "\n")开头的列与前一列相结合：

require(readr)
read_delim(x_collapsed, delim = "#", col_names = FALSE,
       col_types = cols(X9 = col_date("%d/%m/%Y")))[, -1]

现在你可以阅读，例如通过：

# A tibble: 5 × 8
     X2         X3    X4    X5    X6    X7                                             X8         X9
  <int>      <dbl> <chr> <chr> <chr> <int>                                          <chr>     <date>
1  2013 3090050010  CCOU    01   022     1                                              N 2015-03-16
2  2013 3090050010  CCOU    01   023     1                                           <NA> 2015-03-16
3  2013 3090050010  CCOU    02   005     1                                        1692528 2015-03-16
4  2013 3090430110  CCOU    15   504     2 blablablablablablablablablablablablablab labla 2014-10-01
5  2013 3090430110  CCOU    15   505     2                                           <NA> 2014-10-01

结果是：

{{1}}

Answer 2

public static class Files
{
    public const string FileA = "Block1";
    public const string FileB = "Block2";
    public const string FileC = "Block3";
    public const string FileD = "Block6.Block7";
 }

给出

text <- readLines("data.txt")
text_string <- paste0(text, collapse="")

# assuming every line ends in a date, put back line breaks
# by matching and capturing
result <- gsub("(\\d{2}/\\d{2}/\\d{4})\\s?", "\\1\n", text_string, perl = TRUE)

# read from string
df <- read.delim(text = result, header=FALSE, sep = "#")[2:9]

df

Answer 3

利用floo0和epi99的答案，我得出了自己的解决方案，具体如下：

text <- readLines("data.txt")
text_string <- paste0(text, collapse="")
result <- gsub("(#[^#]*#[^#]*#[^#]*#[^#]*#[^#]*#[^#]*#[^#]*#[^#]*)","\\1\n", text_string, perl = TRUE)
df <- read.delim(text = result, header=FALSE, sep = "#")[2:9]

因此它与epi99的不同之处在于它寻找一种模式，其中正确的序列是＆＃34;＃＆＃34;符号出现，可能与其他角色交织在一起。

R有时在两行读取数据

3 个答案: