使用全文搜索的搜索程序(这意味着:很难在程序外重现匹配)返回突出显示内部匹配字符串的行,如:
"i have been <em>match</em>ed"
"a <em>match</em> will happen in the word <em>match</em>"
"some random words including the word <em>match</em> here"
现在我需要获取字符串的前x个字符,但我在里面的html标签遇到了一些麻烦。
像:
"i have been <em>mat</em>..." -> first 15 characters
"a <em>match</em> will happen in the word <em>m</em>..." -> first 33 characters
"some rando..." -> first 10 characters
我尝试过使用其他的东西,但我最终得到了一个大意大利面。
任何提示?
答案 0 :(得分:1)
我建议编写一个包含几个州的简单解析器 - InText
,InOpeningTag
,InClosingTag
是我想到的一些。
只需遍历字符,找出你是InText
,只计算那些字符......一旦达到极限,不要再添加任何文字,如果你在开始和结束标签之间,只需添加结束标记。
如果您不知道我在说什么,请查看HTML Agility Pack的源代码(查找Parse
方法)。
答案 1 :(得分:1)
这应该可以根据<em>
标签执行您想要的操作。
using System;
using System.Collections.Generic;
using System.Text;
namespace Test
{
public class Program
{
public static void Main(string[] args)
{
var dbResults = GetMatches();
var firstLine = HtmlSubstring(dbResults[0], 0, 15);
Console.WriteLine(firstLine);
var secondLine = HtmlSubstring(dbResults[1], 0, 33);
Console.WriteLine(secondLine);
var thirdLine = HtmlSubstring(dbResults[2], 0, 10);
Console.WriteLine(thirdLine);
Console.Read();
}
private static List<string> GetMatches()
{
return new List<string>
{
"i have been <em>match</em>ed"
,"a <em>match</em> will happen in the word <em>match</em>"
, "some random words including the word <em>match</em> here"
};
}
private static string HtmlSubstring(string mainString, int start, int length = int.MaxValue)
{
StringBuilder substringResult = new StringBuilder(mainString.Replace("</em>", "").Replace("<em>", "").Substring(start, length));
// Get indexes between start and (start + length) that need highlighting.
int matchIndex = mainString.IndexOf("<em>", start);
while (matchIndex > 0 && matchIndex < (substringResult.Length - start))
{
int matchIndexConverted = matchIndex - start;
int matchEndIndex = mainString.IndexOf("</em>", matchIndex) - start;
substringResult.Insert(matchIndexConverted, "<em>");
substringResult.Insert(Math.Min(substringResult.Length, matchEndIndex), "</em>");
matchIndex = mainString.IndexOf("<em>", matchIndex + 1);
}
return substringResult.ToString();
}
}
}