Question

我正在使用Stata中的字符串变量response。此变量存储完整的句子，其中许多句子具有重复的短语。

例如：

how do you know how do you know what it is?
it was during the during the past thirty days
well well I would hope I would hope that they're doing that

我想通过删除所有重复的短语来清理这些字符串。

换句话说，我要转换这句话：

how do you know how do you know what it is?

到以下一个：

how do you know what it is?

到目前为止，我已经尝试过分别解决每种情况，但这非常耗时，因为有成千上万的重复单词/短语。

我想运行可识别何时在相同观察值/字符串中重复一个短语的代码，然后删除该短语（或单词）的一个实例。

我认为正则表达式会有所帮助，但我想不出更多的东西。

Answer 1

以下对我有用：

clear
input str80 string
"Pearly Spencer how do you know how do you know what it is?"
"it was during the during the past thirty days"
"well well I would hope I would hope that they're doing that"
"well well they're doing that I would hope I would hope "
"well well I would hope I would hope that they're doing that but but they don't"
end   

clonevar wanted = string
local stop = 0

while `stop' == 0 {
    generate dup = ustrregexs(2) if ustrregexm(wanted, "(\W|^)(.+)\s\2")
    replace wanted = subinstr(wanted, dup, "", 1)

    capture assert dup == ""
    if _rc == 0 local stop = 1
    else drop dup
}

replace wanted = strtrim(stritrim(wanted))

list wanted

     +----------------------------------------------------------+
     |                                                   wanted |
     |----------------------------------------------------------|
  1. |               Pearly Spencer how do you know what it is? |
  2. |                       it was during the past thirty days |
  3. |                well I would hope that they're doing that |
  4. |                     well they're doing that I would hope |
  5. | well I would hope that they're doing that but they don't |
     +----------------------------------------------------------+

以上解决方案使用正则表达式首先识别重复的单词/短语。然后，通过在其位置替换一个空格来从字符串中消除此错误。

因为此特定的正则表达式无法一次找到所有集合（例如，在上一次观察中有三个集合-well，I would hope和but），所以过程为使用while循环重复，直到字符串中没有重复的元素。

最后一步，将删除所有不必要的空格以使字符串恢复原状。

如何删除同一字符串中的重复单词或短语

1 个答案: