R:如何删除字符向量中的重复元素

时间:2018-05-31 02:19:27

标签: r regex string

s <- "height(female), weight, BMI, and BMI."

在上面的字符串中,单词BMI重复两次。我希望字符串为:

"height (female), weight, and BMI."

我尝试了以下方法将字符串分解为独特的部分:

> unique(strsplit(s, " ")[[1]])
[1] "height"      "(female),"   "weight,"    "BMI," "and"         "BMI."

但自“BMI”和“BMI”以来。不是相同的字符串,使用unique并没有摆脱其中一个。

编辑:我怎样才能移动重复的短语? (即体重指数而不是BMI)

s <- "height (female), weight, weight, body mass index, body mass index." 
s <- stringr::str_replace(s, "(?<=, |^)\\b([()\\w\\s]+),\\s(.*?)((?: and)?(?=\\1))", "\\2") 
> stringr::str_replace(s, "(\\w+)(\\(.*?\\))", "\\1 \\2")
[1] "height (female), weight, body mass index, body mass index."

3 个答案:

答案 0 :(得分:1)

首先使用这样的正则表达式替换不需要的重复项可能会有所帮助:

import urllib
import urllib.request
import json

googleGeocodeUrl = 'https://maps.googleapis.com/maps/api/place/textsearch/json?query='
keyword = "hospitales"
geolocation = "&location=-12.135,-77.023&radius=5000"
APIKEY = '&key='+'AIzaSyg5v17Ik'

url = googleGeocodeUrl + keyword + geolocation + APIKEY
print(url)

url = googleGeocodeUrl + keyword + geolocation + APIKEY
json_response = urllib.request.urlopen(url)
search = json_response.read().decode('utf-8')
searchjson = json.loads(search)

export = open('hopital.csv','w')
for place in searchjson['results']:
    print(place['name'])
    print(place['geometry']['location'])
export.write(place['name']+','+str(place['geometry']['location']['lng'])\
 +','+str(place['geometry']['location']['lat'])+'\n')
export.close() 

Demo

<强>解释

  • (?<=,|^)([()\w\s]+),\s(.*?)((?: and)?(?=\1)) 前边界。 ((?<=, |^)\b也应该有效,但没有正确锚定)
  • \b块元素
  • ([()\w\s]+),中间的一切
  • \s(.*?)((?: and)?重复元素

代码示例:

(?=\1))

输出:

#install.packages("stringr")
library(stringr)
s <- "height(female), weight, BMI, and BMI."
stringr::str_replace(s, "(?<=, |^)\\b([()\\w\\s]+),\\s(.*?)((?: and)?(?=\\1))", "\\2")

关于括号中的部分分离,请使用其他替换:

[1] "height(female), weight, and BMI."

输出:

stringr::str_replace(s, "(\\w+)(\\(.*?\\))", "\\1 \\2")

测试并整理东西:

[1] "height (female), weight, and BMI."

输出:

s <- c("height(female), weight, BMI, and BMI."
       ,"height(female), weight, whatever it is, and whatever it is."
       ,"height(female), weight, age, height(female), and BMI."
       ,"weight, weight.")
s <- stringr::str_replace(s, "(?<=, |^)\\b([()\\w\\s]+),\\s(.*?)((?: and)?(?=\\1))", "\\2")
stringr::str_replace(s, "(\\w+)(\\(.*?\\))", "\\1 \\2")

答案 1 :(得分:1)

您可以尝试使用此正则表达式:

(\b\w+\b)[^\w\r\n]+(?=.*\1)

并用空字符串替换每个匹配

<强> Click for Demo

<强> Check the Ruby Code

<强>输入

height(female), weight, BMI, BMI, BMI, BMI, BMI, BMI, BMI, BMI, BMI, BMI, and BMI.
height(female), weight, BMI, age, and BMI.

<强>输出

height(female), weight, and BMI.
height(female), weight, age, and BMI.

<强>解释

  • (\b\w+\b) - 匹配由字边界包围的单词字符的1 +次出现并在第1组中捕获它
  • [^\w\r\n]+ - 匹配任何既不是单词也不是换行符的字符的出现次数。因此,这将匹配,.或空格。
  • (?=.*\1) - 正向前瞻以验证组1中匹配的内容必须在字符串的后面再次出现。只有在这种情况下才会进行更换。

注意:这将保留重复单词的最后一次出现。

或者,如果重复的单词也包含空格,则可以使用(\b[^,]+)[, ]+(?=.*\1)

答案 2 :(得分:0)

library(stringr)

s <- "height(female), weight, BMI, and BMI, and more even more BMI."
pieces <- unlist(str_split(s, "\\b"))
non_word <- !grepl("\\w", pieces)

# if you want to keep just the last instance of a duplicated word
non_duped <- !duplicated(pieces, fromLast = TRUE)
paste0(pieces[non_word | non_duped], collapse = "")
#> [1] "height(female), weight, ,  , and  even more BMI."

# if you want to keep just the first instance of a duplicated word
non_duped <- !duplicated(pieces, fromLast = FALSE)
paste0(pieces[non_word | non_duped], collapse = "")
#> [1] "height(female), weight, BMI, and ,  more even  ."