如何编写正确的正则表达式来获取文本?

时间:2012-09-29 17:46:14

标签: c# regex

我从服务

获得了一些html响应
<style> .transcription, .trsc{line-height:19px; padding-left:20px; font-family:Lucida Sans Unicode; padding-right:5px;} </style><div id="shView"> <div class="cforms_result" id="cforms_result1"> <div class="ref_cform" onclick="javascript:GetFullWordCBK('1', 'wordER');"><span class="fsform_link"><a href="javascript:;" onclick="javascript:GetFullWordCBK('1', 'wordER');"><img src="/images/common/owl_ico16.gif" width="19" height="19" border="0"></a><a href="javascript:;" onclick="javascript:GetFullWordCBK('1', 'wordER');"> Спряжение </a></span><span class="ref_source">mother<wrs><span class="sforms_src"><span class="w_des">Infinitive</span><b>mother</b><br><span class="w_des">Past Indefinite</span><b>mothered</b><br><span class="w_des">Participle II</span><b>mothered</b><br><span class="w_des">Participle I</span><b>mothering</b></span></wrs></span>&nbsp;<span class="ref_info"></span>, <span class="ref_psp">Глагол</span></div> <div class="tr_pr"><span class="transcription">[ˈmʌðə]</span><span class="pronunciation"><a href="javascript:;" class="pbf_s" id="lnkGtTr1" onclick="javascript:ListenWord(this,'mother',1,'play');"><img src="/images/common/vol_on.gif" align="absmiddle" border="0" id="imgGtTr1"><span> powered by <img src="/images/common/logoforvo.gif" width="59" height="17" border="0" hspace="5" align="absmiddle" style="cursor:point; cursor:hand;" onclick="window.open('http://ru.forvo.com/');"></span></a><span class="loadFrv" id="loadFrv1"><img hspace="10" src="/images/common/al_fullWR.gif" align="absmiddle"></span><span style="width:20px; height:17px;" class="pbf_s" id="speaker_on1"><span> powered by <img src="/images/common/logoforvo.gif" width="59" height="17" border="0" hspace="5" align="absmiddle" style="cursor:point; cursor:hand;" onclick="window.open('http://ru.forvo.com/');"></span></span></span></div> <div id="translations" onclick="javascript:GetFullWordCBK('1', 'wordER');"> <ol> <li><span class="ref_result">относиться по-матерински<wrs><span class="sforms_src"></span></wrs></span> <span class="ref_info"></span></li> </ol> </div> </div><script> $('.sforms_src').filter(function(index) { return $(this).html().length == 0;}).remove();//getPrLink('mother ');//$('#speaker_on').unbind('click','ShowFullWRefERRE')//$('#speaker_on').click(function(){alert("не открывать окно расширеной справки");}); </script><div class="cforms_result" id="cforms_result2"> <div class="ref_cform" onclick="javascript:GetFullWordCBK('2', 'wordER');"><span class="fsform_link"><a href="javascript:;" onclick="javascript:GetFullWordCBK('2', 'wordER');"><img src="/images/common/owl_ico16.gif" width="19" height="19" border="0"></a><a href="javascript:;" onclick="javascript:GetFullWordCBK('2', 'wordER');"> Склонение </a></span><span class="ref_source">mother<wrs><span class="sforms_src"><span class="w_des">Singular</span><b>mother</b><br><span class="w_des">Plural</span><b>mothers</b></span></wrs></span>&nbsp;<span class="ref_info"></span>, <span class="ref_psp">Существительное</span></div> <div class="tr_pr"><span class="transcription">[ˈmʌðə]</span><span class="pronunciation"><a href="javascript:;" class="pbf_s" id="lnkGtTr2" onclick="javascript:ListenWord(this,'mother',2,'play');"><img src="/images/common/vol_on.gif" align="absmiddle" border="0" id="imgGtTr2"><span> powered by <img src="/images/common/logoforvo.gif" width="59" height="17" border="0" hspace="5" align="absmiddle" style="cursor:point; cursor:hand;" onclick="window.open('http://ru.forvo.com/');"></span></a><span class="loadFrv" id="loadFrv2"><img hspace="10" src="/images/common/al_fullWR.gif" align="absmiddle"></span><span style="width:20px; height:17px;" class="pbf_s" id="speaker_on2"><span> powered by <img src="/images/common/logoforvo.gif" width="59" height="17" border="0" hspace="5" align="absmiddle" style="cursor:point; cursor:hand;" onclick="window.open('http://ru.forvo.com/');"></span></span></span></div> <div id="translations" onclick="javascript:GetFullWordCBK('2', 'wordER');"> <ol> <li><span class="ref_result">мать<wrs><span class="sforms_src"></span></wrs></span> <span class="ref_info">f</span></li> <li><span class="ref_result">родительский элемент<wrs><span class="sforms_src"></span></wrs></span> <span class="ref_info">m</span><span class="ref_dictionary"> (ИТ - базовый) </span></li> <li><span class="ref_result">родительский<wrs><span class="sforms_src"></span></wrs></span><span class="ref_comment"> (attributive) </span> <span class="ref_info"></span><span class="ref_dictionary"> (ИТ - базовый) </span></li> <li><span class="ref_result">прототип<wrs><span class="sforms_src"></span></wrs></span> <span class="ref_info">m</span><span class="ref_dictionary"> (Политехнический) </span></li> <li><span class="ref_result">начало<wrs><span class="sforms_src"></span></wrs></span> <span class="ref_info">n</span><span class="ref_dictionary"> (Политехнический) </span></li> </ol> </div> </div><script> $('.sforms_src').filter(function(index) { return $(this).html().length == 0;}).remove();//getPrLink('mother ');//$('#speaker_on').unbind('click','ShowFullWRefERRE')//$('#speaker_on').click(function(){alert("не открывать окно расширеной справки");}); </script><div class="cforms_result" id="cforms_result3"> <div class="ref_cform" onclick="javascript:GetFullWordCBK('3', 'wordER');"><span class="fsform_link"><a href="javascript:;" onclick="javascript:GetFullWordCBK('3', 'wordER');"><img src="/images/common/owl_ico16.gif" width="19" height="19" border="0"></a><a href="javascript:;" onclick="javascript:GetFullWordCBK('3', 'wordER');"> Склонение </a></span><span class="ref_source">mother<wrs><span class="sforms_src"><span class="w_des">Positive</span><b>mother</b><br></span></wrs></span>&nbsp;<span class="ref_info"></span>, <span class="ref_psp">Прилагательное</span></div> <div class="tr_pr"><span class="transcription">[ˈmʌðə]</span><span class="pronunciation"><a href="javascript:;" class="pbf_s" id="lnkGtTr3" onclick="javascript:ListenWord(this,'mother',3,'play');"><img src="/images/common/vol_on.gif" align="absmiddle" border="0" id="imgGtTr3"><span> powered by <img src="/images/common/logoforvo.gif" width="59" height="17" border="0" hspace="5" align="absmiddle" style="cursor:point; cursor:hand;" onclick="window.open('http://ru.forvo.com/');"></span></a><span class="loadFrv" id="loadFrv3"><img hspace="10" src="/images/common/al_fullWR.gif" align="absmiddle"></span><span style="width:20px; height:17px;" class="pbf_s" id="speaker_on3"><span> powered by <img src="/images/common/logoforvo.gif" width="59" height="17" border="0" hspace="5" align="absmiddle" style="cursor:point; cursor:hand;" onclick="window.open('http://ru.forvo.com/');"></span></span></span></div> <div id="translations" onclick="javascript:GetFullWordCBK('3', 'wordER');"> <ol> <li><span class="ref_result">родительский<wrs><span class="sforms_src"></span></wrs></span> <span class="ref_info"></span><span class="ref_dictionary"> (ИТ - базовый) </span></li> </ol> </div> </div><script> $('.sforms_src').filter(function(index) { return $(this).html().length == 0;}).remove();//getPrLink('mother ');//$('#speaker_on').unbind('click','ShowFullWRefERRE')//$('#speaker_on').click(function(){alert("не открывать окно расширеной справки");}); </script><div id="fullRLink"><a href="javascript:GetFullWordCBK('1', 'wordER');">Показать полную словарную статью</a><span id="al_fullWR"><img src="/images/common/al_fullWR.gif" align="middle" hspace="10"> Загружаем...</span></div></div>

我希望在此模式<span class="ref_result">TEXT<wrs>

之间获取文字

我使用此代码获取所有匹配的

const string pattern = "ref_result\">\\w+<";
Regex rgx = new Regex(pattern, RegexOptions.Compiled);
var text = SantinizeOutput(result.result);
MatchCollection matches = rgx.Matches(text);
if(matches.Count > 0)
{
  resultsList = new List<string>(matches.Count);
  foreach(Match match in rgx.Matches(text))
  {
    string formattedWord = match.Value;
    int leftAngleBracketIndex = formattedWord.IndexOf(">");
    var word = formattedWord.Remove(0, leftAngleBracketIndex + 1);
    word = word.TrimEnd('<');
    resultsList.Add(word);
  }
}


private string SantinizeOutput(string input)
{
  var text = input.Replace("\n", "").Replace("\r", "");
  return Regex.Replace(text, "\\s+", " ");
}

在本文中,有7个匹配,但结果只有5个。

我犯了错误?

3 个答案:

答案 0 :(得分:3)

\w表示'单词字符';它与空格不匹配。观察到ref_result个标签中的两个包含空格:

<span class="ref_result">относиться по-матерински<wrs>
<span class="ref_result">родительский элемент<wrs>

只需使用"ref_result\">[^<]+<wrs"即可获取所有非标记内容。

答案 1 :(得分:2)

尝试将\ w更改为。*?

所以:

const string pattern = "ref_result\">.*?<";

。*?将获得所有角色(以非贪婪的方式),直到它击中第一个&lt;字符。

。*将获得所有角色(以贪婪的方式),直到它击中最后一个&lt;字符。你会想要使用非贪婪的方法。

答案 2 :(得分:0)

通过更改正则表达式,您还可以删除代码中的某些逻辑。

const string pattern = "ref_result\">([^<]*)";
Regex rgx = new Regex(pattern, RegexOptions.Compiled);
var text = SantinizeOutput(result.result);
MatchCollection matches = rgx.Matches(text);

List<string> resultsList = new List<string>(matches.Count);
for(int i=0; i<resultsList.Length; i++) {
  resultsList.Add(matches[i].Groups[1].Value);
}

private string SantinizeOutput(string input) {
  var text = input.Replace("\n", "").Replace("\r", "");
  return Regex.Replace(text, "\\s+", " ");
}