GTK,GLib&正则表达式:标记同时匹配并转义匹配项外的特殊字符

时间:2018-01-16 19:11:07

标签: c regex gtk glib gtktreeview

我有一个树视图的搜索功能,突出显示所有匹配项,包括。无区分和区分大小写之间的区别,以及正则表达式和文字区分。但是,当当前单元格包含不属于匹配项的特殊字符时,我遇到了问题。请考虑树视图单元格中的以下文本:

father & mother

现在我想要在整个树视图中搜索字母'e'。为了突出显示匹配而不是整个单元格,我需要使用标记。为此,我使用g_regex_replace_eval及其回调函数in the way as stated inside the GLib documentation。生成的单元格的新标记文本如下:

fath<span background='yellow' foreground='black'>e</span>r & 
moth<span background='yellow' foreground='black'>e</span>r

如果匹配项中有特殊字符,它们会在添加到eval函数使用的哈希表之前进行转义。所以里面匹配的特殊字符没问题。

但我有'&amp;'现在外部标记部分,并且必须更改为&amp;,否则标记将不会显示在单元格中并显示警告

Failed to set text from markup due to error parsing markup: Error on line x: Entity did not end with a semicolon; most likely you used an ampersand character without intending to start an entity - escape ampersand as &

将显示在终端内。

如果我在新的单元格文本上使用g_markup_escape_text,它显然不仅会逃避'&amp;',而且还会''&lt;'和'&gt;'标记,所以这不是解决方案。

是否有合理的方法在匹配项周围添加标记并同时转换标记外的特殊字符或使用视图步骤?到目前为止,我能想到的一切都太复杂了,如果它可以工作的话。

修改

尽管在提出问题之前我已经考虑过菲利普在其大部分内容中的建议,但我还没有触及utf8的主题,所以他给出了解决方案的重要提示。以下是工作实施的核心:

gchar *counter_char = original_cell_txt; // counter_char will move through all the characters of original_cell_txt.
gint counter;

gunichar unichar;
gchar utf8_char[6]; // Six bytes is the buffer size needed later by g_unichar_to_utf8 (). 
gint utf8_length;
gchar *utf8_escaped;

enum { START_POS, END_POS };
GArray *positions[2];
positions[START_POS] = g_array_new (FALSE, FALSE, sizeof (gint));
positions[END_POS] = g_array_new (FALSE, FALSE, sizeof (gint));
gint start_position, end_position;

txt_with_markup = g_string_new ("");    

g_regex_match (regex, original_cell_txt, 0, &match_info);

while (g_match_info_matches (match_info)) {
    g_match_info_fetch_pos (match_info, 0, &start_position, &end_position);
    g_array_append_val (positions[START_POS], start_position);
    g_array_append_val (positions[END_POS], end_position);
    g_match_info_next (match_info, NULL);
}

do {
    unichar = g_utf8_get_char (counter_char);
    counter = counter_char - original_cell_txt; // pointer arithmetic

    if (counter == g_array_index (positions[END_POS], gint, 0)) {
        txt_with_markup = g_string_append (txt_with_markup, "</span>");
        // It's simpler to always access the first element instead of looping through the whole array.
        g_array_remove_index (positions[END_POS], 0);
     }
     /*
         No "else if" is used here, since if there is a search for a single character going on and  
         such a character appears double as 'm' in "command", between both m's a span tag has to be 
         closed and opened at the same position.
     */
     if (counter == g_array_index (positions[START_POS], gint, 0)) {
         txt_with_markup = g_string_append (txt_with_markup, "<span background='yellow' foreground='black'>");
         // See the comment for the similar instruction above.
         g_array_remove_index (positions[START_POS], 0);
     }

     utf8_length = g_unichar_to_utf8 (unichar, utf8_char);
     /*
         Instead of using a switch statement to check whether the current character needs to be escaped, 
         for simplicity the character is sent to the escape function regardless of whether there will be 
         any escaping done by it or not.
     */
     utf8_escaped = g_markup_escape_text (utf8_char, utf8_length);

     txt_with_markup = g_string_append (txt_with_markup, utf8_escaped);

     // Cleanup
     g_free (utf8_escaped);

     counter_char = g_utf8_find_next_char (counter_char, NULL);
} while (*counter_char != '\0');

/*
    There is a '</span>' to set at the end; because the end position is one position after the string size
    this couldn't be done inside the preceding loop.
*/            
if (positions[END_POS]->len) {
    g_string_append (txt_with_markup, "</span>");
}

g_object_set (txt_renderer, "markup", txt_with_markup->str, NULL);

// Cleanup
g_regex_unref (regex);
g_match_info_free (match_info);
g_array_free (positions[START_POS], TRUE);
g_array_free (positions[END_POS], TRUE);

2 个答案:

答案 0 :(得分:1)

执行此操作的方法可能是不使用[TestCase("2017-01-01T01:01:01.0010000Z", "2016-12-01T01:01:01.0010000Z", 1)] [TestCase("2017-02-01T01:01:01.0010000Z", "2016-12-01T01:01:01.0010000Z", 2)] [TestCase("2017-03-31T01:01:01.0010000Z", "2016-12-31T01:01:01.0010000Z", 3)] [TestCase("2016-03-28T01:01:01.0010000Z", "2016-02-28T01:01:01.0010000Z", 1)] [TestCase("2016-03-31T01:01:01.0010000Z", "2016-02-29T01:01:01.0010000Z", 1)] [TestCase("2017-03-31T01:01:01.0010000Z", "2017-02-28T01:01:01.0010000Z", 1)] [TestCase("2016-02-29T01:01:01.0010000Z", "2016-01-31T01:01:01.0010000Z", 1)] [TestCase("2017-02-28T01:01:01.0010000Z", "2017-01-31T01:01:01.0010000Z", 1)] [TestCase("2016-12-01T01:01:01.0010000Z", "2017-01-01T01:01:01.0010000Z", -1)] [TestCase("2016-12-01T01:01:01.0010000Z", "2017-02-01T01:01:01.0010000Z", -2)] [TestCase("2016-12-31T01:01:01.0010000Z", "2017-03-31T01:01:01.0010000Z", -3)] [TestCase("2016-02-28T01:01:01.0010000Z", "2016-03-28T01:01:01.0010000Z", -1)] public void DateTimeExtensions_AddMonthsCustom(DateTime expected, DateTime dateTime, int months) { // Arrange expected = expected.ToUniversalTime(); dateTime = dateTime.ToUniversalTime(); // Act DateTime result = dateTime.AddMonthsCustom(months); // Assert Assert.AreEqual(expected.Kind, result.Kind); Assert.AreEqual(expected, result); } ,而是使用g_regex_replace_eval()来获取字符串的匹配列表。然后,您需要逐个字符地逐步执行字符串(使用g_regex_match_all()函数执行此操作,因为这必须是支持Unicode的)。如果您找到需要转义的字符(g_utf8_*()<>&"),请为其输出转义实体。当你到达匹配位置时,输出正确的标记。

答案 1 :(得分:0)

我首先使用g_markup_escape_text转义整个文本,然后转义文本进行搜索并在g_regex_replace_eval中使用它。这样就可以匹配转义文本,并且已经转义了未匹配的文本。