我使用extract_tables()从PDF文件中提取了一个表格,但文本已分散到多行中。每条记录的行数有所不同。我想将文本合并为一个值。
我想做的事情与this post类似。区别在于我在多列中都有文本。每个条目使用的记录数是可变的,具体取决于每次的不同列。
示例:一个条目可能占据四行,因为“名称和位置”列分布在四行中(而其他列仅占该条目的两行;其余列用NA填充)。对于另一个条目,由于“专业”列中文本的长度,该文本可能会分布在6行中。
每当“级别”列包含一个值而不是NA时,就会开始一个新记录。 修改:“级别”值不唯一
我的数据如下:
Name & location Expertise Type Sector Payment Level
1: Ms. Jane Student Higher Government and payment 1
2: Doe, <NA> Education education has been <NA>
3: NUS <NA> institute <NA> received <NA>
4: Andrew Saunders Phd., Chief Municipal Government and payment 5
5: Municipality of Education government education has not <NA>
6: Amsterdam Officer <NA> <NA> been <NA>
7: <NA> <NA> <NA> <NA> received <NA>
8: Mr. Stephen Spokesperson for Municipal Government and payment 3
9: Johnson, Sustainability, government education has not <NA>
10: Orange County Health & <NA> <NA> been <NA>
11: <NA> Wellbeing and <NA> <NA> received <NA>
12: <NA> Wellfare <NA> <NA> <NA> <NA>
13: Mrs. Susan Junior national Government and payment 4
14: Andrews, Research government education has not <NA>
15: Police Manager <NA> <NA> been <NA>
16: <NA> Money <NA> <NA> received <NA>
17: <NA> Laundering <NA> <NA> <NA> <NA>
可复制的示例:
structure(list(`Name & location` = c("1: Ms. Jane", "2: Doe,",
"3: NUS", "4: Andrew Saunders Phd.,", "5: Municipality of",
"6: Amsterdam", "7: <NA>", "8: Mr. Stephen", "9: Johnson,",
"10: Orange County", "11: <NA>", "12: <NA>", "13: Mrs. Susan",
"14: Andrews,", "15: Police", "16: <NA>", "17: <NA>"),
Expertise = c("Student", NA, NA, "Chief", "Education", "Officer",
NA, "Spokesperson for", "Sustainability,", "Health &", "Wellbeing and",
"Wellfare", "Junior", "Research", "Manager", "Money", "Laundering"
), Type = c("Higher", "Education", "Insititute", "Municipal",
"Government", NA, NA, "Municipal", "Government", NA, NA,
NA, "National", "Government", NA, NA, NA), Sector = c("Government and",
"education", NA, "Government and", "education", NA, NA, "Government and",
"education", NA, NA, NA, "Government and", "education", NA,
NA, NA), Payment = c("payment", "has been", "received", "Payment",
"has not", "been", "received", "Payment", "has not", "been",
"received", NA, "Payment", "has not", "been", "received",
NA), Level = c(1, NA, NA, 5, NA, NA, NA, 3, NA, NA, NA, NA,
4, NA, NA, NA, NA)), row.names = c(NA, -17L), class = c("tbl_df",
"tbl", "data.frame"))
到目前为止,我尝试的是下面代码的不同版本
DF_clean <- DF %>% mutate(Level = ifelse(grepl(NA, Level))) %>%
group_by(id = cumsum(!is.na(Level))) %>%
mutate(Level = first(Level)) %>%
group_by(Level) %>%
summarise(Name = paste(Name, collapse = " "),
Expertise = paste(Expertise, collapse = " "),
Type = paste(Type, collapse = " "),
Sector = paste(Sector, collapse = " "),
Level = paste(Level, collapse = " "))
但这似乎将所有文本折叠到一个记录中。
关于如何解决此问题的任何想法?
答案 0 :(得分:3)
肯定有一些更漂亮的解决方案,但这似乎可行。如果Level
包含重复值,则也可以使用。
# Remove row numbers and <NA> from Name & Location
df <- df %>%
mutate(`Name & location` = gsub("[0-9]+:\\s+", "", `Name & location`)) %>%
mutate(`Name & location` = gsub("<NA>", "", `Name & location`))
# Compute ranges to merge
starts <- c(which(!is.na(df$Level)), nrow(df) + 1)
ranges <- sapply(
1:(length(starts) - 1),
function(x)
starts[x]:(starts[x + 1] - 1)
)
# Merge lines based on ranges
combined_df <- lapply(
ranges,
function(x)
lapply(df[x, ], function(x) gsub(" +$| NA", "", paste0(x, collapse = " ")))
) %>%
bind_rows
# A tibble: 4 x 6
`Name & location` Expertise Type Sector Payment Level
<chr> <chr> <chr> <chr> <chr> <chr>
1 Ms. Jane Doe, NUS Student Higher Education Insititute Government and education payment has been received 1
2 Andrew Saunders Phd., Municipality of Amsterdam Chief Education Officer Municipal Government Government and education Payment has not been received 5
3 Mr. Stephen Johnson, Orange County Spokesperson for Sustainability, Health & Wellbeing and Wellfare Municipal Government Government and education Payment has not been received 3
4 Mrs. Susan Andrews, Police Junior Research Manager Money Laundering National Government Government and education Payment has not been received 4
编辑:
我使用@Andrew的解决方案来计算新的unique_level
列并使之工作。比我的第一个解决方案恕我直言更漂亮:
library(tidyverse)
df <- df %>%
mutate(`Name & location` = gsub("[0-9]+:\\s+", "", `Name & location`)) %>%
mutate(`Name & location` = gsub("<NA>", "", `Name & location`)) %>%
mutate(unique_level = ifelse(!is.na(Level), 1, NA) * 1:nrow(df)) %>%
fill(unique_level, .direction = "down") %>%
group_by(unique_level) %>%
summarise_all(~ gsub(" +$| NA", "", paste(., collapse = " "))) %>%
select(-unique_level)
前两个mutate
调用从<NA>
列中删除行号和Name & location
。 gsub
中的summarise_all
调用删除行尾并在将行粘贴到一起时添加了NA
。
答案 1 :(得分:2)
已编辑:
在这里,这可以将其清理干净,并且还可以处理非杂音水平。您还需要安装data.table
,因为我使用rleid
创建了一个新的关卡变量(假设可以覆盖它并丢失实际的关卡值)。如果您需要保留原始级别,只需创建一个新的rleid级别列并以此分组即可。让我知道您是否有任何疑问!
df1 %>%
fill(Level, .direction = "down") %>%
mutate(`Name & location` = gsub("[0-9]+:\\s+(<NA>)*", "", `Name & location`)) %>%
replace(is.na(.), "") %>%
group_by(Level = data.table::rleid(Level)) %>%
summarise_all(~trimws(paste(., collapse = " ")
Level `Name & location` Expertise Type Sector Payment
<chr> <chr> <chr> <chr> <chr> <chr>
1 1 Ms. Jane Doe, NUS Student Higher Education~ Government and ~ payment has been r~
2 2 Andrew Saunders Phd., Municipalit~ Chief Education Officer Municipal Govern~ Government and ~ Payment has not be~
3 3 Mr. Stephen Johnson, Orange County Spokesperson for Sustainability, Health ~ Municipal Govern~ Government and ~ Payment has not be~
4 4 Mrs. Susan Andrews, Police Junior Research Manager Money Laundering National Governm~ Government and ~ Payment has not be~