Question

s <- "height(female), weight, BMI, and BMI."

在上面的字符串中，单词BMI重复两次。我希望字符串为：

"height (female), weight, and BMI."

我尝试了以下方法将字符串分解为独特的部分：

> unique(strsplit(s, " ")[[1]])
[1] "height"      "(female),"   "weight,"    "BMI," "and"         "BMI."

但自“BMI”和“BMI”以来。不是相同的字符串，使用unique并没有摆脱其中一个。

编辑：我怎样才能移动重复的短语？（即体重指数而不是BMI）

s <- "height (female), weight, weight, body mass index, body mass index." 
s <- stringr::str_replace(s, "(?<=, |^)\\b([()\\w\\s]+),\\s(.*?)((?: and)?(?=\\1))", "\\2") 
> stringr::str_replace(s, "(\\w+)(\\(.*?\\))", "\\1 \\2")
[1] "height (female), weight, body mass index, body mass index."

Answer 1

首先使用这样的正则表达式替换不需要的重复项可能会有所帮助：

import urllib
import urllib.request
import json

googleGeocodeUrl = 'https://maps.googleapis.com/maps/api/place/textsearch/json?query='
keyword = "hospitales"
geolocation = "&location=-12.135,-77.023&radius=5000"
APIKEY = '&key='+'AIzaSyg5v17Ik'

url = googleGeocodeUrl + keyword + geolocation + APIKEY
print(url)

url = googleGeocodeUrl + keyword + geolocation + APIKEY
json_response = urllib.request.urlopen(url)
search = json_response.read().decode('utf-8')
searchjson = json.loads(search)

export = open('hopital.csv','w')
for place in searchjson['results']:
    print(place['name'])
    print(place['geometry']['location'])
export.write(place['name']+','+str(place['geometry']['location']['lng'])\
 +','+str(place['geometry']['location']['lat'])+'\n')
export.close()

Demo

<强>解释

(?<=,|^)([()\w\s]+),\s(.*?)((?: and)?(?=\1))前边界。（(?<=, |^)\b也应该有效，但没有正确锚定）
\b块元素
([()\w\s]+),中间的一切
\s(.*?)((?: and)?重复元素

代码示例：

(?=\1))

输出：

#install.packages("stringr")
library(stringr)
s <- "height(female), weight, BMI, and BMI."
stringr::str_replace(s, "(?<=, |^)\\b([()\\w\\s]+),\\s(.*?)((?: and)?(?=\\1))", "\\2")

关于括号中的部分分离，请使用其他替换：

[1] "height(female), weight, and BMI."

输出：

stringr::str_replace(s, "(\\w+)(\\(.*?\\))", "\\1 \\2")

测试并整理东西：

[1] "height (female), weight, and BMI."

输出：

s <- c("height(female), weight, BMI, and BMI."
       ,"height(female), weight, whatever it is, and whatever it is."
       ,"height(female), weight, age, height(female), and BMI."
       ,"weight, weight.")
s <- stringr::str_replace(s, "(?<=, |^)\\b([()\\w\\s]+),\\s(.*?)((?: and)?(?=\\1))", "\\2")
stringr::str_replace(s, "(\\w+)(\\(.*?\\))", "\\1 \\2")

Answer 2

您可以尝试使用此正则表达式：

(\b\w+\b)[^\w\r\n]+(?=.*\1)

并用空字符串替换每个匹配

<强> Click for Demo

<强> Check the Ruby Code

<强>输入

height(female), weight, BMI, BMI, BMI, BMI, BMI, BMI, BMI, BMI, BMI, BMI, and BMI.
height(female), weight, BMI, age, and BMI.

<强>输出

height(female), weight, and BMI.
height(female), weight, age, and BMI.

<强>解释

(\b\w+\b) - 匹配由字边界包围的单词字符的1 +次出现并在第1组中捕获它
[^\w\r\n]+ - 匹配任何既不是单词也不是换行符的字符的出现次数。因此，这将匹配,，.或空格。
(?=.*\1) - 正向前瞻以验证组1中匹配的内容必须在字符串的后面再次出现。只有在这种情况下才会进行更换。

注意：这将保留重复单词的最后一次出现。

或者，如果重复的单词也包含空格，则可以使用(\b[^,]+)[, ]+(?=.*\1)。

Answer 3

library(stringr)

s <- "height(female), weight, BMI, and BMI, and more even more BMI."
pieces <- unlist(str_split(s, "\\b"))
non_word <- !grepl("\\w", pieces)

# if you want to keep just the last instance of a duplicated word
non_duped <- !duplicated(pieces, fromLast = TRUE)
paste0(pieces[non_word | non_duped], collapse = "")
#> [1] "height(female), weight, ,  , and  even more BMI."

# if you want to keep just the first instance of a duplicated word
non_duped <- !duplicated(pieces, fromLast = FALSE)
paste0(pieces[non_word | non_duped], collapse = "")
#> [1] "height(female), weight, BMI, and ,  more even  ."

R：如何删除字符向量中的重复元素

3 个答案: