这里是我的子字符串
> substring(reut2.000[4,], regexpr(">",reut2.000[3,]) + 1)
[1] "<D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>"
我想提取<D>
和</D>
之间的所有字符。
在这种情况下,输出将为
"el-salvador","use","uruguay"
到目前为止,我已经尝试过
gsub(".*<D>\\s*|</D>.*", "", tmp)
其中tmp
是子字符串,它返回"uruguay"
。
如何修改它以便返回所有位置?
答案 0 :(得分:4)
您有一个XML file(<< ==也可能正是您所拥有的文件)。请注意,该链接指向tmparallel
包中的示例文件,并且有many places in that package that have code that works with it。
使用XML as XML。 不对其进行正则表达式。
以下代码段中的 xdf$places
具有您要查找的内容,但是由于该文件很可能在文本挖掘的类中使用,因此最终可能需要将所有其他位提取到数据帧中。
library(xml2)
library(tidyverse)
download.file(
"https://raw.githubusercontent.com/noahhl/tmparallel/master/pkg/inst/texts/reuters-21578.xml",
"~/Data/reuters-21578.xml"
)
reut <- read_xml("~/Data/reuters-21578.xml")
xml_find_all(reut, "//REUTERS") %>%
map_df(~{
xml_attrs(.x) %>%
as.list() %>%
as_data_frame() -> xdf
xdf$date <- xml_find_first(.x, ".//DATE") %>% xml_text(trim=TRUE)
#### NOTE THAT THIS FOLLOWING LINE IS THE DATA YOU ASKED FOR IN THE EXAMPLE
xdf$places <- list(xml_find_all(.x, ".//PLACES/D") %>% xml_text(trim=TRUE))
xdf$people <- list(xml_find_all(.x, ".//PEOPLE/D") %>% xml_text(trim=TRUE))
xdf$orgs <- list(xml_find_all(.x, ".//ORGS/D") %>% xml_text(trim=TRUE))
xdf$exchanges <- list(xml_find_all(.x, ".//EXCHANGES/D") %>% xml_text(trim=TRUE))
xdf$companies <- list(xml_find_all(.x, ".//COMPANIES/D") %>% xml_text(trim=TRUE))
xdf$uknown <- xml_find_first(.x, ".//UNKNOWN") %>% xml_text(trim=TRUE)
xdf$text_title <- xml_find_first(.x, ".//TEXT/TITLE") %>% xml_text(trim=TRUE)
xdf$text_dateline <- xml_find_first(.x, ".//TEXT/DATELINE") %>% xml_text(trim=TRUE)
xdf$text_body <- xml_find_first(.x, ".//TEXT/BODY") %>% xml_text(trim=TRUE)
xdf
}) -> text_df
输出:
text_df
## # A tibble: 10 x 15
## TOPICS LEWISSPLIT CGISPLIT OLDID NEWID date places people orgs
## <chr> <chr> <chr> <chr> <chr> <chr> <list> <list> <lis>
## 1 YES TRAIN TRAINING… 5544 1 26-FEB-1… <chr [… <chr [… <chr…
## 2 NO TRAIN TRAINING… 5545 2 26-FEB-1… <chr [… <chr [… <chr…
## 3 NO TRAIN TRAINING… 5546 3 26-FEB-1… <chr [… <chr [… <chr…
## 4 NO TRAIN TRAINING… 5547 4 26-FEB-1… <chr [… <chr [… <chr…
## 5 YES TRAIN TRAINING… 5548 5 26-FEB-1… <chr [… <chr [… <chr…
## 6 YES TRAIN TRAINING… 5549 6 26-FEB-1… <chr [… <chr [… <chr…
## 7 NO TRAIN TRAINING… 5550 7 26-FEB-1… <chr [… <chr [… <chr…
## 8 YES TRAIN TRAINING… 5551 8 26-FEB-1… <chr [… <chr [… <chr…
## 9 YES TRAIN TRAINING… 5552 9 26-FEB-1… <chr [… <chr [… <chr…
## 10 YES TRAIN TRAINING… 5553 10 26-FEB-1… <chr [… <chr [… <chr…
## # ... with 6 more variables: exchanges <list>, companies <list>,
## # uknown <chr>, text_title <chr>, text_dateline <chr>, text_body <chr>
glimpse(text_df)
## Observations: 10
## Variables: 15
## $ TOPICS <chr> "YES", "NO", "NO", "NO", "YES", "YES", "NO", "YE...
## $ LEWISSPLIT <chr> "TRAIN", "TRAIN", "TRAIN", "TRAIN", "TRAIN", "TR...
## $ CGISPLIT <chr> "TRAINING-SET", "TRAINING-SET", "TRAINING-SET", ...
## $ OLDID <chr> "5544", "5545", "5546", "5547", "5548", "5549", ...
## $ NEWID <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"
## $ date <chr> "26-FEB-1987 15:01:01.79", "26-FEB-1987 15:02:20...
## $ places <list> [<"el-salvador", "usa", "uruguay">, "usa", "usa...
## $ people <list> [<>, <>, <>, <>, <>, <>, <>, <>, <>, <>]
## $ orgs <list> [<>, <>, <>, <>, <>, <>, <>, <>, <>, <>]
## $ exchanges <list> [<>, <>, <>, <>, <>, <>, <>, <>, <>, <>]
## $ companies <list> [<>, <>, <>, <>, <>, <>, <>, <>, <>, <>]
## $ uknown <chr> "C T\nf0704reute\nu f BC-BAHIA-COCOA-REVIEW 02...
## $ text_title <chr> "BAHIA COCOA REVIEW", "STANDARD OIL <SRD> TO FOR...
## $ text_dateline <chr> "SALVADOR, Feb 26 -", "CLEVELAND, Feb 26 -", "HO...
## $ text_body <chr> "Showers continued throughout the week in\nthe B...
str(head(text_df, 2))
## Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 15 variables:
## $ TOPICS : chr "YES" "NO"
## $ LEWISSPLIT : chr "TRAIN" "TRAIN"
## $ CGISPLIT : chr "TRAINING-SET" "TRAINING-SET"
## $ OLDID : chr "5544" "5545"
## $ NEWID : chr "1" "2"
## $ date : chr "26-FEB-1987 15:01:01.79" "26-FEB-1987 15:02:20.00"
## $ places :List of 2
## ..$ : chr "el-salvador" "usa" "uruguay"
## ..$ : chr "usa"
## $ people :List of 2
## ..$ : chr
## ..$ : chr
## $ orgs :List of 2
## ..$ : chr
## ..$ : chr
## $ exchanges :List of 2
## ..$ : chr
## ..$ : chr
## $ companies :List of 2
## ..$ : chr
## ..$ : chr
## $ uknown : chr "C T\nf0704reute\nu f BC-BAHIA-COCOA-REVIEW 02-26 0105" "F Y\nf0708reute\nd f BC-STANDARD-OIL-<SRD>-TO 02-26 0082"
## $ text_title : chr "BAHIA COCOA REVIEW" "STANDARD OIL <SRD> TO FORM FINANCIAL UNIT"
## $ text_dateline: chr "SALVADOR, Feb 26 -" "CLEVELAND, Feb 26 -"
## $ text_body : chr "Showers continued throughout the week in\nthe Bahia cocoa zone, alleviating the drought since early\nJanuary an"| __truncated__ "Standard Oil Co and BP North America\nInc said they plan to form a venture to manage the money market\nborrowin"| __truncated__
答案 1 :(得分:1)
这是使用grepexpr
和regmatches
来捕获文本中所有匹配项的一种选择:
input <- c("<D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>")
m <- gregexpr("(?<=<D>).*?(?=</D>)", input, perl=TRUE)
regmatches(input, m)[[1]]
[1] "el-salvador" "usa" "uruguay"
请注意,通常不建议使用正则表达式来解析HTML / XML或类似内容。原因之一是可能存在嵌套标签,从而导致一个简单的正则表达式中断。
答案 2 :(得分:0)
使用gsub的另一个选项:
temp <- "<D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>"
temp <- unlist(strsplit(gsub("<D>|</D>|</PLACES>", " ", x = temp ), split = " "))
temp <- temp[temp != ""]