如何删除同一字符串中的重复单词或短语

时间:2019-08-21 18:55:38

标签: regex string stata

我正在使用Stata中的字符串变量response。此变量存储完整的句子,其中许多句子具有重复的短语。

例如:

how do you know how do you know what it is?
it was during the during the past thirty days
well well I would hope I would hope that they're doing that

我想通过删除所有重复的短语来清理这些字符串。

换句话说,我要转换这句话:

how do you know how do you know what it is?

到以下一个:

how do you know what it is?

到目前为止,我已经尝试过分别解决每种情况,但这非常耗时,因为有成千上万的重复单词/短语。

我想运行可识别何时在相同观察值/字符串中重复一个短语的代码,然后删除该短语(或单词)的一个实例。

我认为正则表达式会有所帮助,但我想不出更多的东西。

1 个答案:

答案 0 :(得分:2)

以下对我有用:

clear
input str80 string
"Pearly Spencer how do you know how do you know what it is?"
"it was during the during the past thirty days"
"well well I would hope I would hope that they're doing that"
"well well they're doing that I would hope I would hope "
"well well I would hope I would hope that they're doing that but but they don't"
end   

clonevar wanted = string
local stop = 0

while `stop' == 0 {
    generate dup = ustrregexs(2) if ustrregexm(wanted, "(\W|^)(.+)\s\2")
    replace wanted = subinstr(wanted, dup, "", 1)

    capture assert dup == ""
    if _rc == 0 local stop = 1
    else drop dup
}

replace wanted = strtrim(stritrim(wanted))

list wanted

     +----------------------------------------------------------+
     |                                                   wanted |
     |----------------------------------------------------------|
  1. |               Pearly Spencer how do you know what it is? |
  2. |                       it was during the past thirty days |
  3. |                well I would hope that they're doing that |
  4. |                     well they're doing that I would hope |
  5. | well I would hope that they're doing that but they don't |
     +----------------------------------------------------------+

以上解决方案使用正则表达式首先识别重复的单词/短语。然后,通过在其位置替换一个空格来从字符串中消除此错误。

因为此特定的正则表达式无法一次找到所有集合(例如,在上一次观察中有三个集合-wellI would hopebut),所以过程为使用while循环重复,直到字符串中没有重复的元素。

最后一步,将删除所有不必要的空格以使字符串恢复原状。