如何在R

时间:2015-10-28 05:06:19

标签: r text-parsing

需要读取txt文件 https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt

并将它们转换为数据框R,列号为:LastName,FirstName,streetno,streetname,city,state和zip ...

尝试使用sep命令将它们分开但失败了......

4 个答案:

答案 0 :(得分:4)

扩展我的评论,这是另一种方法。如果您的完整数据集有更广泛的模式需要考虑,您可能需要调整一些代码。

library(stringr) # For str_trim 

# Read string data and split into data frame
dat = readLines("addr.txt")
dat = as.data.frame(do.call(rbind, strsplit(dat, split=" {2,10}")), stringsAsFactors=FALSE)
names(dat) = c("LastName", "FirstName", "address", "city", "state", "zip")

# Separate address into number and street (if streetno isn't always numeric,
# or if you don't want it to be numeric, then just remove the as.numeric wrapper).
dat$streetno = as.numeric(gsub("([0-9]{1,4}).*","\\1",  dat$address))
dat$streetname = gsub("[0-9]{1,4} (.*)","\\1",  dat$address)

# Clean up zip
dat$zip = gsub("O","0", dat$zip)
dat$zip = str_trim(dat$zip)

dat = dat[,c(1:2,7:8,4:6)]

dat
      LastName  FirstName streetno           streetname       city state        zip
1        Bania  Thomas M.      725    Commonwealth Ave.     Boston    MA      02215
2      Barnaby      David      373        W. Geneva St.   Wms. Bay    WI      53191
3       Bausch       Judy      373        W. Geneva St.   Wms. Bay    WI      53191
...
41      Wright       Greg      791  Holmdel-Keyport Rd.    Holmdel    NY 07733-1988
42     Zingale    Michael     5640        S. Ellis Ave.    Chicago    IL      60637

答案 1 :(得分:1)

试试这个。

x<-scan("https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt" , 
  what = list(LastName="", FirstName="", streetno="", streetname="", city="", state="",zip=""))

data<-as.data.frame(x)

答案 2 :(得分:1)

此处您的问题不是如何使用R来读取此数据,而是使用您作为输入的可变长度字段之间使用常规分隔符来充分构建数据。此外,邮政编码字段包含一些alpha&#34; O&#34;字符应为&#34; 0&#34;。

因此,这是一种使用正则表达式替换添加分隔符,然后使用read.csv()解析分隔文本的方法。请注意,根据整个文本集中的例外情况,您可能需要调整正则表达式。我已经在这里一步一步地完成了它们,以明确正在做什么,以便您可以在输入文本中找到异常时进行调整。 (例如,某些城市名​​称,例如`Wms.Bay&#34;是两个单词。)

addr.txt <- readLines("https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt")
addr.txt <- gsub("\\s+O(\\d{4})", " 0\\1", addr.txt)       # replace O with 0 in zip
addr.txt <- gsub("(\\s+)([A-Z]{2})", ", \\2", addr.txt)    # state
addr.txt <- gsub("\\s+(\\d{5}(\\-\\d{4}){0,1})\\s*", ", \\1", addr.txt) # zip
addr.txt <- gsub("\\s+(\\d{1,4})\\s", ", \\1, ", addr.txt) # streetno
addr.txt <- gsub("(^\\w*)(\\s+)", "\\1, ", addr.txt)       # LastName (FirstName)
addr.txt <- gsub("\\s{2,}", ", ", addr.txt)                # city, by elimination

addr <- read.csv(textConnection(addr.txt), header = FALSE,
                 col.names = c("LastName", "FirstName", "streetno", "streetname", "city", "state", "zip"),
                 stringsAsFactors = FALSE)
head(addr)
##     LastName   FirstName streetno         streetname      city state    zip
## 1      Bania   Thomas M.      725  Commonwealth Ave.    Boston    MA  02215
## 2    Barnaby       David      373      W. Geneva St.  Wms. Bay    WI  53191
## 3     Bausch        Judy      373      W. Geneva St.  Wms. Bay    WI  53191
## 4    Bolatto     Alberto      725  Commonwealth Ave.    Boston    MA  02215
## 5  Carlstrom        John      933        E. 56th St.   Chicago    IL  60637
## 6 Chamberlin  Richard A.      111         Nowelo St.      Hilo    HI  96720

答案 3 :(得分:1)

我发现通过添加逗号所在的文件将文件修复到csv最简单,然后阅读它。

## get the page as text
txt <- RCurl::getURL(
    "https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt"
)
## fix the EOL (end-of-line) markers
g1 <- gsub(" \n", "\n", txt, fixed = TRUE)
## read it
df <- read.csv(
    ## add most comma-separators, then the last for the house number
    text = gsub("(\\d+) (\\D+)", "\\1,\\2", gsub("\\s{2,}", ",", g1)), 
    header = FALSE,
    ## set the column names
    col.names = c("LastName", "FirstName", "streetno", "streetname", "city", "state", "zip")
)
## result
head(df)
#     LastName  FirstName streetno        streetname     city state   zip
# 1      Bania  Thomas M.      725 Commonwealth Ave.   Boston    MA O2215
# 2    Barnaby      David      373     W. Geneva St. Wms. Bay    WI 53191
# 3     Bausch       Judy      373     W. Geneva St. Wms. Bay    WI 53191
# 4    Bolatto    Alberto      725 Commonwealth Ave.   Boston    MA O2215
# 5  Carlstrom       John      933       E. 56th St.  Chicago    IL 60637
# 6 Chamberlin Richard A.      111        Nowelo St.     Hilo    HI 96720