Question

我希望将搜索查询标记为类似Google的操作方式。例如，如果我有以下搜索查询：

the quick "brown fox" jumps over the "lazy dog"

我想要一个包含以下标记的字符串数组：

the
quick
brown fox
jumps
over
the
lazy dog

如您所见，令牌用双引号保留空格。

我正在寻找一些如何在C＃中执行此操作的示例，最好不要使用正则表达式，但是如果这样做最有意义并且性能最高，那就这样吧。

此外，我想知道如何扩展它以处理其他特殊字符，例如，在术语前放置 - 强制从搜索查询中排除等等。

Answer 1

到目前为止，这似乎是RegEx的一个很好的候选人。如果它变得更加复杂，那么可能需要更复杂的标记化方案，但除非必要，否则应该避免使用该路由，因为这样做的工作要多得多。（另一方面，对于复杂的模式，正则表达式很快变成了狗，同样应该避免使用。）

这个正则表达式可以解决你的问题：

("[^"]+"|\w+)\s*

以下是其用法的C＃示例：

string data = "the quick \"brown fox\" jumps over the \"lazy dog\"";
string pattern = @"(""[^""]+""|\w+)\s*";

MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
    string group = m.Groups[0].Value;
}

这种方法的真正好处是它可以很容易地扩展到包含你的“ - ”要求，如下所示：

string data = "the quick \"brown fox\" jumps over " +
              "the \"lazy dog\" -\"lazy cat\" -energetic";
string pattern = @"(-""[^""]+""|""[^""]+""|-\w+|\w+)\s*";

MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
    string group = m.Groups[0].Value;
}

现在我讨厌和下一个人一样阅读正则表达式，但如果你把它分开，这个很容易阅读：

(
-"[^"]+"
|
"[^"]+"
|
-\w+
|
\w+
)\s*

<强>解释

如果可能的话，匹配一个减号，然后是“后面的所有内容，直到下一个”
否则匹配“跟随所有内容直到下一个”
否则匹配a - 后跟任何单词字符
否则匹配尽可能多的单词字符
将结果放入一个组
吞下任何后续空格字符

Answer 2

通过char将char转到这样的字符串:(有点伪代码）

array words = {} // empty array
string word = "" // empty word
bool in_quotes = false
for char c in search string:
    if in_quotes:
        if c is '"':
            append word to words
            word = "" // empty word
            in_quotes = false
        else:
            append c to word
   else if c is '"':
        in_quotes = true
   else if c is ' ': // space
       if not empty word:
           append word to words
           word = "" // empty word
   else:
        append c to word

// Rest
if not empty word:
    append word to words

Answer 3

我几天前只想弄清楚如何做到这一点。我最终使用了Microsoft.VisualBasic.FileIO.TextFieldParser，它完全符合我的要求（只需将HasFieldsEnclosedInQuotes设置为true）。当然，在C＃程序中使用“Microsoft.VisualBasic”看起来有些奇怪，但它可以工作，据我所知它是.NET框架的一部分。

为了将我的字符串放入TextFieldParser的流中，我使用了“new MemoryStream（new ASCIIEncoding（）。GetBytes（stringvar））”。不确定这是否是最好的方法。

编辑：我不认为这会处理你的“ - ”要求，所以RegEx解决方案可能更好

Answer 4

我正在寻找一个解决这个问题的Java解决方案，并提出了使用@Michael La Voie的解决方案。尽管在C＃中被问到这个问题，我还是想在这里分享一下。希望没关系。

public static final List<String> convertQueryToWords(String q) {
    List<String> words = new ArrayList<>();
    Pattern pattern = Pattern.compile("(\"[^\"]+\"|\\w+)\\s*");
    Matcher matcher = pattern.matcher(q);
    while (matcher.find()) {
        MatchResult result = matcher.toMatchResult();
        if (result != null && result.group() != null) {
            if (result.group().contains("\"")) {
                words.add(result.group().trim().replaceAll("\"", "").trim());
            } else {
                words.add(result.group().trim());
            }
        }
    }
    return words;
}

类似Google的搜索查询标记化＆amp;字符串拆分

4 个答案: