我正在使用Stata中的字符串变量response
。此变量存储完整的句子,其中许多句子具有重复的短语。
例如:
how do you know how do you know what it is?
it was during the during the past thirty days
well well I would hope I would hope that they're doing that
我想通过删除所有重复的短语来清理这些字符串。
换句话说,我要转换这句话:
how do you know how do you know what it is?
到以下一个:
how do you know what it is?
到目前为止,我已经尝试过分别解决每种情况,但这非常耗时,因为有成千上万的重复单词/短语。
我想运行可识别何时在相同观察值/字符串中重复一个短语的代码,然后删除该短语(或单词)的一个实例。
我认为正则表达式会有所帮助,但我想不出更多的东西。
答案 0 :(得分:2)
以下对我有用:
clear
input str80 string
"Pearly Spencer how do you know how do you know what it is?"
"it was during the during the past thirty days"
"well well I would hope I would hope that they're doing that"
"well well they're doing that I would hope I would hope "
"well well I would hope I would hope that they're doing that but but they don't"
end
clonevar wanted = string
local stop = 0
while `stop' == 0 {
generate dup = ustrregexs(2) if ustrregexm(wanted, "(\W|^)(.+)\s\2")
replace wanted = subinstr(wanted, dup, "", 1)
capture assert dup == ""
if _rc == 0 local stop = 1
else drop dup
}
replace wanted = strtrim(stritrim(wanted))
list wanted
+----------------------------------------------------------+
| wanted |
|----------------------------------------------------------|
1. | Pearly Spencer how do you know what it is? |
2. | it was during the past thirty days |
3. | well I would hope that they're doing that |
4. | well they're doing that I would hope |
5. | well I would hope that they're doing that but they don't |
+----------------------------------------------------------+
以上解决方案使用正则表达式首先识别重复的单词/短语。然后,通过在其位置替换一个空格来从字符串中消除此错误。
因为此特定的正则表达式无法一次找到所有集合(例如,在上一次观察中有三个集合-well
,I would hope
和but
),所以过程为使用while
循环重复,直到字符串中没有重复的元素。
最后一步,将删除所有不必要的空格以使字符串恢复原状。