我在R中输出了网页报废数据,如下所示
Name1
Email: email1@xyz.com
City/Town: Location1
Name2
Email: email2@abc.com
City/Town: Location2
Name3
Email: email3@pqr.com
City/Town: Location3
某些名称可能没有电子邮件或位置。我想将上面的数据转换为表格格式。输出应该看起来像
Name Email City/Town
Name1 email1@xyz.com Location1
Name2 email2@abc.com Location2
Name3 email3@pqr.com Location3
Name4 Location4
Name5 email5@abc.com
答案 0 :(得分:4)
使用:
txt <- readLines(txt)
library(data.table)
library(zoo)
dt <- data.table(txt = txt)
dt[!grepl(':', txt), name := txt
][, name := na.locf(name)
][grepl('^Email:', txt), email := sub('Email: ','',txt)
][grepl('^City/Town:', txt), city_town := sub('City/Town: ','',txt)
][txt != name, lapply(.SD, function(x) toString(na.omit(x))), by = name, .SDcols = c('email','city_town')]
给出:
name email city_town
1: Name1 email1@xyz.com Location1
2: Name2 email2@abc.com Location2
3: Name3 email3@pqr.com Location3
4: Name4 Location4
5: Name5 email5@abc.com
这也适用于真实姓名。使用@uweBlock的数据,您将获得:
name email city_town 1: John Doe email1@xyz.com Location1 2: Save the World Fund email2@abc.com Location2 3: Best Shoes Ltd. email3@pqr.com Location3 4: Mother Location4 5: Jane email5@abc.com
每个部分有多个键(再次使用@ UweBlock&#39;)
name email city_town 1: John Doe email1@xyz.com, email1@abc.com Location1 2: Save the World Fund email2@abc.com Location2 3: Best Shoes Ltd. email3@pqr.com Location3 4: Mother Location4, everywhere 5: Jane email5@abc.com
使用过的数据:
txt <- textConnection("Name1
Email: email1@xyz.com
City/Town: Location1
Name2
Email: email2@abc.com
City/Town: Location2
Name3
Email: email3@pqr.com
City/Town: Location3
Name4
City/Town: Location4
Name5
Email: email5@abc.com")
答案 1 :(得分:4)
在每个名称前面插入\nName:
,然后使用read.dcf
将其读取(如果数据来自文件,则使用文件名替换textConnection(Lines)
,例如"myfile.dat"
,第一行代码。)没有使用包。
L <- trimws(readLines(textConnection(Lines)))
ix <- !grepl(":", L)
L[ix] <- paste("\nName:", L[ix])
read.dcf(textConnection(L))
使用最后注释中的输入给出以下内容:
Name Email City/Town
[1,] "Name1" "email1@xyz.com" "Location1"
[2,] "Name2" NA "Location2"
[3,] "Name3" "email3@pqr.com" NA
注意:使用的输入。这个问题略有修改,表明如果缺少电子邮件或城市/城镇,它可以正常工作:
Lines <- "Name1
Email: email1@xyz.com
City/Town: Location1
Name2
City/Town: Location2
Name3
Email: email3@pqr.com"
答案 2 :(得分:3)
输入数据提出了一些挑战:
": "
以下代码仅依赖于两个假设:
toString()
通过将dcast()
指定为library(data.table)
# coerce to data.table
data.table(text = txt)[
# split key/value pairs in columns
, tstrsplit(text, ": ")][
# pick section headers and create new column
is.na(V2), Name := V1][
# fill in Name into the rows below
, Name := zoo::na.locf(Name)][
# reshape key/value pairs from long to wide format using Name as row id
!is.na(V2), dcast(.SD, Name ~ V1, fun = toString, value.var = "V2")]
的聚合函数来处理部分中的多个密钥,例如,具有电子邮件地址的多个行。
Name City/Town Email
1: Name1 Location1 email1@xyz.com
2: Name2 Location2 email2@abc.com
3: Name3 Location3 email3@pqr.com
4: Name4 Location4 NA
5: Name5 NA email5@abc.com
txt <- c("Name1", "Email: email1@xyz.com", "City/Town: Location1", "Name2", "Email: email2@abc.com", "City/Town: Location2", "Name3", "Email: email3@pqr.com", "City/Town: Location3", "Name4", "City/Town: Location4", "Name5", "Email: email5@abc.com")
txt1 <- c("John Doe", "Email: email1@xyz.com", "City/Town: Location1", "Save the World Fund",
"Email: email2@abc.com", "City/Town: Location2", "Best Shoes Ltd.", "Email: email3@pqr.com",
"City/Town: Location3", "Mother", "City/Town: Location4", "Jane",
"Email: email5@abc.com")
或者,尝试更多&#34;现实&#34;名称
Name City/Town Email
1: Best Shoes Ltd. Location3 email3@pqr.com
2: Jane NA email5@abc.com
3: John Doe Location1 email1@xyz.com
4: Mother Location4 NA
5: Save the World Fund Location2 email2@abc.com
将导致:
txt2 <- c("John Doe", "Email: email1@xyz.com", "Email: email1@abc.com", "City/Town: Location1", "Save the World Fund", "Email: email2@abc.com", "City/Town: Location2", "Best Shoes Ltd.", "Email: email3@pqr.com", "City/Town: Location3", "Mother", "City/Town: Location4", "City/Town: everywhere", "Jane", "Email: email5@abc.com")
或者,每个部分有多个键
Name City/Town Email
1: Best Shoes Ltd. Location3 email3@pqr.com
2: Jane email5@abc.com
3: John Doe Location1 email1@xyz.com, email1@abc.com
4: Mother Location4, everywhere
5: Save the World Fund Location2 email2@abc.com
{{1}}
答案 3 :(得分:3)
使用 dplyr 和 tidyr ,对@Jaap txt
和@UweBlock txt1
提供的数据进行了测试:
library(dplyr)
library(tidyr)
# data_frame(txt = txt1) %>%
data_frame(txt = txt) %>%
mutate(txt = if_else(grepl(":", txt), txt, paste("Name:", txt)),
rn = row_number()) %>%
separate(txt, into = c("mytype", "mytext"), sep = ":") %>%
spread(key = mytype, value = mytext) %>%
select(-rn) %>%
fill(Name) %>%
group_by(Name) %>%
fill(1:2, .direction = "down") %>%
fill(1:2, .direction = "up") %>%
unique() %>%
ungroup() %>%
select(3:1)
# # A tibble: 5 x 3
# Name Email `City/Town`
# <chr> <chr> <chr>
# 1 Name1 email1@xyz.com Location1
# 2 Name2 email2@abc.com Location2
# 3 Name3 email3@pqr.com Location3
# 4 Name4 <NA> Location4
# 5 Name5 email5@abc.com <NA>
注意:
rn
的原因。 答案 4 :(得分:2)
txt2 <- c("John Doe", "Email: email1@xyz.com", "Email: email1@abc.com", "City/Town: Location1", "Save the World Fund",
"Email: email2@abc.com", "City/Town: Location2", "Best Shoes Ltd.", "Email: email3@pqr.com",
"City/Town: Location3", "Mother", "City/Town: Location4", "City/Town: everywhere", "Jane",
"Email: email5@abc.com")
library(microbenchmark)
library(data.table)
library(dplyr)
library(tidyr)
microbenchmark(ans.uwe = data.table(text = txt2)[, tstrsplit(text, ": ")
][is.na(V2), Name := V1
][, Name := zoo::na.locf(Name)
][!is.na(V2), dcast(.SD, Name ~ V1, fun = toString, value.var = "V2")],
ans.zx8754 = data_frame(txt = txt2) %>%
mutate(txt = ifelse(grepl(":", txt), txt, paste("Name:", txt)),
rn = row_number()) %>%
separate(txt, into = c("mytype", "mytext"), sep = ":") %>%
spread(key = mytype, value = mytext) %>%
select(-rn) %>%
fill(Name) %>%
group_by(Name) %>%
fill(1:2, .direction = "down") %>%
fill(1:2, .direction = "up") %>%
unique() %>%
ungroup() %>%
select(3:1),
ans.jaap = data.table(txt = txt2)[!grepl(':', txt), name := txt
][, name := zoo::na.locf(name)
][grepl('^Email:', txt), email := sub('Email: ','',txt)
][grepl('^City/Town:', txt), city_town := sub('City/Town: ','',txt)
][txt != name, lapply(.SD, function(x) toString(na.omit(x))), by = name, .SDcols = c('email','city_town')],
ans.G.Grothendieck = {
L <- trimws(readLines(textConnection(txt2)))
ix <- !grepl(":", L)
L[ix] <- paste("\nName:", L[ix])
read.dcf(textConnection(L))},
times = 1000)
Unit: microseconds
expr min lq mean median uq max neval cld
ans.uwe 4243.754 4885.4765 5305.8688 5139.0580 5390.360 92604.820 1000 c
ans.zx8754 39683.911 41771.2925 43940.7646 43168.4870 45291.504 130965.088 1000 d
ans.jaap 2153.521 2488.0665 2788.8250 2640.1580 2773.150 91862.177 1000 b
ans.G.Grothendieck 266.268 304.0415 332.6255 331.8375 349.797 721.261 1000 a