测试字符串中的所有字符是否不是字母数字

时间:2019-03-10 05:16:13

标签: regex string stata

下面的字符串可能是错误的API调用的结果:

_±êµÂ’¥÷“_¡“__‘_Ó ’¥Ï“ùü’ÄÛ“_« “_Ô“Ü“ù÷ “Ïã“_÷’¥Ï “µÏ“ÄÅ“ù÷ “Á¡ê±«“ùã ê¡Û“_ã “__’

我不确定哪些行包含非字母数字字符,我的任务是确定哪些行有问题。

另一个问题是,一些非字母数字字符出现在我仍要保留和搜索的字符串中,例如:

This sentence is fine and searchable, but a few non-alphanumeric äóî donäó»t popup

是否可以测试字符串的全部内容是否为非字母数字?

1 个答案:

答案 0 :(得分:2)

您可以使用正则表达式查找仅包含标准字母和数字字符的所有行,包括逗号,句点,感叹号和问号以及空格:

clear
input str168 var1
"_±êµÂ’¥÷“_¡“__‘_Ó ’¥Ï“ùü’ÄÛ“_« “_Ô“Ü“ù÷ “Ïã“_÷’¥Ï “µÏ“ÄÅ“ù÷ “Á¡ê±«“ùã ê¡Û“_ã “__’"
"This sentence is fine and searchable, but a few non unicode äóî donäó»t popup"                                                                                     
" This is a regular sentence of course"                                                                                                                                   
" another sentence, but with comma"                                                                                                                                       
" but what happens with question marks?"                                                                                                                                  
" or perhaps an exclamation mark!"                                                                                                                                       
end

generate tag = ustrregexm(var1, "^[A-Za-z0-9 ,.?!]*$")

. list tag, separator(0)

     +-----+
     | tag |
     |-----|
  1. |   0 |
  2. |   0 |
  3. |   1 |
  4. |   1 |
  5. |   1 |
  6. |   1 |
     +-----+

另一种可能性是使用正则表达式排除不包含任何字母和数字字符的任何行,这种解决方案在这种情况下涵盖了两种必需的情况:

clear
input str168 var1
"_±êµÂ’¥÷“_¡“__‘_Ó ’¥Ï“ùü’ÄÛ“_« “_Ô“Ü“ù÷ “Ïã“_÷’¥Ï “µÏ“ÄÅ“ù÷ “Á¡ê±«“ùã ê¡Û“_ã “__’"
"This sentence is fine and searchable, but a few non unicode äóî donäó»t popup"                                                                                     
" This is a regular sentence of course"                                                                                                                                   
" another sentence, but with comma"                                                                                                                                       
" but what happens with question marks?"                                                                                                                                  
" or perhaps an exclamantion mark!"                                                                                                                                       
"¥Ï“ùü’ÄÛ“_« “_Ô“Ü“ù÷ "                                                                                                                          
"¥Ï“ùü’ÄÛ hihuo"                                                                                                                                                
end

generate tag = ustrregexm(var1, "^[^A-Za-z0-9]*$")

list tag, separator(0)

     +-----+
     | tag |
     |-----|
  1. |   1 |
  2. |   0 |
  3. |   0 |
  4. |   0 |
  5. |   0 |
  6. |   0 |
  7. |   1 |
  8. |   0 |
     +-----+