Question

我在C＃中解析一个html文件并从html中提取文本。我的html文件里面有很多标签。 Html文件有select标签和选项标签。我需要一个正则表达式来从html文件中删除select标签和选项标签。我不想要这些信息。所以我想使用任何正则表达式删除它。

以下是我要从html文件中删除的html：

 <select name="state" onchange="setCities();" id="state">>

 <option value="CA" selected="selected">CA</option>
 <option value="WA">WA</option>
 <option value="TX">TX</option>
 <option value="NV">NV</option>
 <option value="CO">CO</option>
 <option value="MI">MI</option>
 <option value="SC">SC</option>

Answer 1

您无需使用RegEx来简单地剥离HTML标记。以下方法遍历HTML代码字符串并创建一个没有任何标记的新返回字符串这种方式也比RegEx快。

public static string StripHTMLTags(string str)
    {
        char[] array = new char[str.Length];
        int arrayIndex = 0;
        bool inside = false;

        for (int i = 0; i < str.Length; i++)
        {
            char c = str[i];
            if (c == '<')
            {
                inside = true;
                continue;
            }
            if (c == '>')
            {
                inside = false;
                continue;
            }
            if (!inside)
            {
                array[arrayIndex] = c;
                arrayIndex++;
            }
        }
        return new string(array, 0, arrayIndex);
    }

需要正则表达式来删除<select>和<option> html标签</option> </select>

1 个答案: