从一个可恶的当地政府网站下载了一堆数据。有77,000个项目条目看起来与以下内容完全相同,包含在纯文本文件中。我需要将这堆粪便导入R作为数据框:
Instrument: 201301240005447
Recorded: 01/24/2013
Consideration: $150,125.00
Document Type: MORTGAGES
Pages: 17
Grantor: BYRES, CONNIE R / BYRES, SCOTT
Grantee: MORTGAGE ELECTRONIC REGISTRATION SYSTEMS INC / QUICKEN LOANS INC
Legal Description: * St:5495 MCNAMARA LN City:FLINT PrpId:1135532002 CC:11 T:8 R:7 S:35 ext:PT OF NE4
*
---------------------------------/---------------------------------
Instrument: 201301240005408
Recorded: 01/24/2013
Consideration: $65,124.00
Document Type: MORTGAGES
Pages: 17
Grantor: SANNE, BETTY LOU / SANNE, KENNETH D
Grantee: JPMORGAN CHASE BANK NA
Legal Description: Sub:WOODCROFT NO 1 Lt:188 St:2213 RADCLIFFE AVE City:FLINT PrpId:4024106003 CC:54
*
---------------------------------/---------------------------------
有常见的字符向量,如“Instrument”,“Grantor”和“PrpId”。我究竟如何将其导入R?这会涉及解析或抓取某种类型吗?
毋庸置疑,我尝试将此文件导入Excel但无效。我认为R会更好地工作,只需要弄清楚如何。感谢
答案 0 :(得分:7)
初学者与R所以我相信人们会添加更好的方法,但只要每个记录的字段按数量和顺序固定,这里就有效;
# Use gsubfn to get read.pattern
install.packages('gsubfn')
library(gsubfn)
# Read all data rows into 'data'
data = read.pattern('Test/test.txt', '([^:]*):(.*)', as.is=TRUE, fill=TRUE)
# Reshape the data to 8 columns
df = as.data.frame(matrix(data$V2, ncol=8, byrow=TRUE))
# Set the column names to reasonable values.
colnames(df) = data$V1[1:8]
Instrument Recorded Consideration Document Type Pages Grantor Grantee Legal Description
1 201301240005447 01/24/2013 $150,125.00 MORTGAGES 17 BYRES, CONNIE R / BYRES, SCOTT MORTGAGE ELECTRONIC REGISTRATION SYSTEMS INC / QUICKEN LOANS INC * St:5495 MCNAMARA LN City:FLINT PrpId:1135532002 CC:11 T:8 R:7 S:35 ext:PT OF NE4
2 201301240005408 01/24/2013 $65,124.00 MORTGAGES 17 SANNE, BETTY LOU / SANNE, KENNETH D JPMORGAN CHASE BANK NA Sub:WOODCROFT NO 1 Lt:188 St:2213 RADCLIFFE AVE City:FLINT PrpId:4024106003 CC:54
答案 1 :(得分:2)
rl <- readLines(textConnection('Instrument: 201301240005447
Recorded: 01/24/2013
Consideration: $150,125.00
Document Type: MORTGAGES
Pages: 17
Grantor: BYRES, CONNIE R / BYRES, SCOTT
Grantee: MORTGAGE ELECTRONIC REGISTRATION SYSTEMS INC / QUICKEN LOANS INC
Legal Description: * St:5495 MCNAMARA LN City:FLINT PrpId:1135532002 CC:11 T:8 R:7 S:35 ext:PT OF NE4
*
---------------------------------/---------------------------------
Instrument: 201301240005408
Recorded: 01/24/2013
Consideration: $65,124.00
Document Type: MORTGAGES
Pages: 17
Grantor: SANNE, BETTY LOU / SANNE, KENNETH D
Grantee: JPMORGAN CHASE BANK NA
Legal Description: Sub:WOODCROFT NO 1 Lt:188 St:2213 RADCLIFFE AVE City:FLINT PrpId:4024106003 CC:54
*
---------------------------------/---------------------------------'))
您可以使用它来选择您想要的东西,定义一个辅助函数来提取每个字段(类似于我回答的问题earlier today)
n <- c('Instrument', 'Recorded', 'Consideration', 'Document Type',
'Pages', 'Grantor', 'Grantee', 'Legal Description')
f <- function(what, string = rl) {
gsub(sprintf('%s\\:\\s*([^~]*)|.', what), '\\1', string, perl = TRUE)
}
## read in the lines and do some minimal processing
rl <- gsub('^\\* ', '\n', rl[grepl('^[A-Z*]', rl)])
rl <- paste0(rl, collapse = '~')
rl <- strsplit(rl, '\\n')[[1]]
data.frame(setNames(lapply(n, f), n))
# Instrument Recorded Consideration Document.Type Pages
# 1 201301240005447 01/24/2013 $150,125.00 MORTGAGES 17
# 2 201301240005408 01/24/2013 $65,124.00 MORTGAGES 17
# Grantor
# 1 BYRES, CONNIE R / BYRES, SCOTT
# 2 SANNE, BETTY LOU / SANNE, KENNETH D
# Grantee
# 1 MORTGAGE ELECTRONIC REGISTRATION SYSTEMS INC / QUICKEN LOANS INC
# 2 JPMORGAN CHASE BANK NA
# Legal.Description
# 1 * St:5495 MCNAMARA LN City:FLINT PrpId:1135532002 CC:11 T:8 R:7 S:35 ext:PT OF NE4
# 2 Sub:WOODCROFT NO 1 Lt:188 St:2213 RADCLIFFE AVE City:FLINT PrpId:4024106003 CC:54
或
n <- c('Instrument', 'Recorded', 'Consideration')
data.frame(setNames(lapply(n, f), n))
# Instrument Recorded Consideration
# 1 201301240005447 01/24/2013 $150,125.00
# 2 201301240005408 01/24/2013 $65,124.00
答案 2 :(得分:2)
我编写了一个非常通用的解析函数,可以处理任何分区线和字段值分隔符模式,指定为参数化正则表达式。它还可以选择从字段值中删除尾随空格,并将可变参数传递给构建结果data.frame的单个data.frame()
调用。
sectionedFieldLinesToFrame <- function(lines,divRE,sepRE,select,rtw=T,...) {
divLineIndexes <- grep(perl=T,divRE,lines);
## remove possible leading and trailing divs, for robustness
if (length(divLineIndexes)>0L && divLineIndexes[1L]==1L) {
leadDivCount <- match(T,c(diff(divLineIndexes)!=1L,T));
lines <- lines[-seq_len(leadDivCount)];
divLineIndexes <- divLineIndexes[-seq_len(leadDivCount)]-leadDivCount;
}; ## end if
if (length(divLineIndexes)>0L && divLineIndexes[length(divLineIndexes)]==length(lines)) {
trailDivCount <- match(T,c(rev(diff(divLineIndexes)!=1L),T));
lines <- lines[-seq(to=length(lines),len=trailDivCount)];
divLineIndexes <- divLineIndexes[-seq(to=length(divLineIndexes),len=trailDivCount)];
}; ## end if
## get fields to extract
if (missing(select)) {
allFieldLineIndexes <- grep(perl=T,sepRE,lines);
fields <- unique(sub(perl=T,paste0(sepRE,'.*'),'',lines[allFieldLineIndexes]));
} else {
fields <- select;
}; ## end if
## extract each field vector and build the data.frame
do.call(data.frame,c(setNames(lapply(fields,function(field) {
fieldLineIndexes <- grep(perl=T,paste0('^\\Q',field,'\\E',sepRE),lines);
sectionIndexes <- findInterval(fieldLineIndexes,divLineIndexes); ## 0-based
values <- sub(perl=T,paste0('^.*?',sepRE),'',lines[fieldLineIndexes]);
if (rtw) values <- sub(perl=T,'\\s+$','',values);
values[match(seq(0L,length(divLineIndexes)),sectionIndexes)];
}),fields),...));
}; ## end sectionedFieldLinesToFrame()
以下是如何使用它:
fileName <- 'data.txt';
divRE <- '^-+/-+$';
sepRE <- ':\\s*';
df <- sectionedFieldLinesToFrame(readLines(fileName),divRE,sepRE,stringsAsFactors=F);
str(df);
## 'data.frame': 2 obs. of 8 variables:
## $ Instrument : chr "201301240005447" "201301240005408"
## $ Recorded : chr "01/24/2013" "01/24/2013"
## $ Consideration : chr "$150,125.00" "$65,124.00"
## $ Document.Type : chr "MORTGAGES" "MORTGAGES"
## $ Pages : chr "17" "17"
## $ Grantor : chr "BYRES, CONNIE R / BYRES, SCOTT" "SANNE, BETTY LOU / SANNE, KENNETH D"
## $ Grantee : chr "MORTGAGE ELECTRONIC REGISTRATION SYSTEMS INC / QUICKEN LOANS INC" "JPMORGAN CHASE BANK NA"
## $ Legal.Description: chr "* St:5495 MCNAMARA LN City:FLINT PrpId:1135532002 CC:11 T:8 R:7 S:35 ext:PT OF NE4" "Sub:WOODCROFT NO 1 Lt:188 St:2213 RADCLIFFE AVE City:FLINT PrpId:4024106003 CC:54"
您还可以指定select
参数以准确选择要提取的字段:
select <- c('Instrument','Pages','Grantor');
df <- sectionedFieldLinesToFrame(readLines(fileName),divRE,sepRE,select,stringsAsFactors=F);
df;
## Instrument Pages Grantor
## 1 201301240005447 17 BYRES, CONNIE R / BYRES, SCOTT
## 2 201301240005408 17 SANNE, BETTY LOU / SANNE, KENNETH D
我已尽力使其尽可能健壮。它仔细处理可能的冗余前导和尾随分隔线,并正确处理各部分之间不一致字段的情况。
值得强调的是最后一点。所提供的所有其他解决方案都对输入数据做出了非常脆弱的假设,要么每个部分恰好有8个字段始终以相同的顺序排列,要么每个部分都出现每个(可能是硬编码的)字段名称。如果违反了这个假设,那些解决方案就变得毫无用处。我的函数不对字段编号,名称或一致性做出任何假设。它动态检索任何部分中存在的所有字段名称,并构建每个字段的正确向量,生成NA
元素,其中字段不存在于给定部分中。
以下是一些例子:
sectionedFieldLinesToFrame(character(),'^-$',':');
## data frame with 0 columns and 0 rows
sectionedFieldLinesToFrame(rep('-',2L),'^-$',':');
## data frame with 0 columns and 0 rows
sectionedFieldLinesToFrame(c('A:a','-'),'^-$',':');
## A
## 1 a
sectionedFieldLinesToFrame(c('A:a','-','-'),'^-$',':');
## A
## 1 a
sectionedFieldLinesToFrame(c('A:a','-','B:b','-'),'^-$',':');
## A B
## 1 a <NA>
## 2 <NA> b
sectionedFieldLinesToFrame(c('A:a','B:b','-','B:c','-'),'^-$',':');
## A B
## 1 a b
## 2 <NA> c
sectionedFieldLinesToFrame(c('A:a','B:b','-','B:c','-','A:d'),'^-$',':');
## A B
## 1 a b
## 2 <NA> c
## 3 d <NA>
sectionedFieldLinesToFrame(c('-','-','A:a','B:b','-','B:c','-','A:d','C:e','-'),'^-$',':');
## A B C
## 1 a b <NA>
## 2 <NA> c <NA>
## 3 d <NA> e
sectionedFieldLinesToFrame(c('-','A:a','B:b','-','-','B:c','-','A:d','C:e','-'),'^-$',':');
## A B C
## 1 a b <NA>
## 2 <NA> <NA> <NA>
## 3 <NA> c <NA>
## 4 d <NA> e
答案 3 :(得分:1)
仅使用{base}而不使用正则表达式的解决方案。这不是很优雅:
# read file and parse out values from field names
q <- readLines('ugly.txt')
q <- lapply(X = q, FUN = strsplit, split = ': ')
q <- unlist(q)
q <- matrix(data = q, ncol = 2, byrow = T)
COLUMNS <- unique(q[,1])
q <- q[,2]
# move values to rows of a DF and set names for DF
q <- matrix(data = q, ncol = 9, byrow = T)
q <- data.frame(q)
names(q) <- COLUMNS
# clean up data types
q$Recorded <- as.Date(q$Recorded)
q$Consideration <- as.numeric(q$Consideration)
q$Pages <- as.numeric(q$Pages)
View(q)