假设我有以下字符串:
Hellotoevryone<img height="115" width="150" alt="" src="/Content/Edt/image/b4976875-8dfb-444c-8b32-cc b47b2d81e0.jpg" />Iamsogladtoseeall.
此字符串表示未被空格分隔的字符序列,在此字符串中还插入了html图像。现在我想将字符串分成单词,每个单词的长度为10个字符,因此输出应为:
1)Hellotoevr
2)yone<img height="115" width="150" alt="" src="/Content/Edt/image/b4976875-8dfb-444c-8b32-cc b47b2d81e0.jpg" />Iamsog
3)ladtoseeal
4)l.
因此,我们的想法是将任何html标记内容保留为0长度字符。
我写过这样的方法,但没有考虑到html标签:
public static string EnsureWordLength(this string target, int length)
{
string[] words = target.Split(' ');
for (int i = 0; i < words.Length; i++)
if (words[i].Length > length)
{
var possible = true;
var ord = 1;
do
{
var lengthTmp = length*ord+ord-1;
if (lengthTmp < words[i].Length) words[i] = words[i].Insert(lengthTmp, " ");
else possible = false;
ord++;
} while (possible);
}
return string.Join(" ", words);
}
我希望看到一个执行我所描述的拆分的代码。谢谢。
答案 0 :(得分:3)
这是符合您要求的正则表达式解决方案。请记住,如果您决定稍微改变您的要求,这可能不会起作用,这对well known quote here忠实。
using System.Text.RegularExpressions;
string[] samples = {
@"Hellotoevryone<img height=""115"" width=""150"" alt="""" src=""/Content/Edt/image/b4976875-8dfb-444c-8b32-cc b47b2d81e0.jpg"" />Iamsogladtoseeall.",
"Testing123Hello.World",
@"Test<a href=""http://stackoverflow.com"">StackOverflow</a>",
@"Blah<a href=""http://stackoverflow.com"">StackOverflow</a>Blah<a href=""http://serverfault.com"">ServerFault</a>",
@"Test<a href=""http://serverfault.com"">Server Fault</a>", // has a space, not matched
"Stack Overflow" // has a space, not matched
};
// use these 2 lines if you don't want to use regex comments
//string pattern = @"^((?:\S(?:\<[^>]+\>)?){1,10})+$";
//Regex rx = new Regex(pattern);
// regex comments spanning multiple lines requires use of RegexOptions.IgnorePatternWhitespace
string pattern = @"^( # match line/string start, begin group
(?:\S # match (but don't capture) non-whitespace chars
(?:\<[^>]+\>)? # optionally match (doesn't capture) an html <...> tag
# to match img tags only change to (?:\<img[^>]+\>)?
){1,10} # match upto 10 chars (tags don't count per your example)
)+$ # match at least once, and match end of line/string
";
Regex rx = new Regex(pattern, RegexOptions.IgnorePatternWhitespace);
foreach (string sample in samples)
{
if (rx.IsMatch(sample))
{
foreach (Match m in rx.Matches(sample))
{
// using group index 1, group 0 is the entire match which I'm not interested in
foreach (Capture c in m.Groups[1].Captures)
{
Console.WriteLine("Capture: {0} -- ({1})", c.Value, c.Value.Length);
}
}
}
else
{
Console.WriteLine("Not a match: {0}", sample);
}
Console.WriteLine();
}
使用上面的示例,这是输出(括号中的数字=字符串长度):
Capture: Hellotoevr -- (10)
Capture: yone<img height="115" width="150" alt="" src="/Content/Edt/image/b49768
75-8dfb-444c-8b32-cc b47b2d81e0.jpg" />Iamsog -- (116)
Capture: ladtoseeal -- (10)
Capture: l. -- (2)
Capture: Testing123 -- (10)
Capture: Hello.Worl -- (10)
Capture: d -- (1)
Capture: Test<a href="http://stackoverflow.com">StackO -- (45)
Capture: verflow</a> -- (11)
Capture: Blah<a href="http://stackoverflow.com">StackO -- (45)
Capture: verflow</a>Bla -- (14)
Capture: h<a href="http://serverfault.com">ServerFau -- (43)
Capture: lt</a> -- (6)
Not a match: Test<a href="http://serverfault.com">Server Fault</a>
Not a match: Stack Overflow
答案 1 :(得分:1)
以下代码将处理您提供的案例,但会破坏任何更复杂的案例。此外,由于您没有指定如何处理带有内部文本或HTML的长格式标签,因此它将所有标签视为短格式标签(运行代码以查看我的意思)。
使用此输入:
Hellotoevryone<img height="115" width="150" alt="" src="/Content/Edt/image/b4976875-8dfb-444c-8b32-cc b47b2d81e0.jpg" />Iamsogladtoseeall. Hellotoevryone<img src="/Content/Edt/image/b4976875-8dfb-444c-8b32-cc b47b2d81e0.jpg" />Iamsoglad<img src="baz.jpeg" />toseeall. Hello<span class="foo">toevryone</span>Iamso<em>glad</em>toseeallTheQuickBrown<img src="bar.jpeg" />FoxJumpsOverTheLazyDog. Hello<span class="foo">toevryone</span>Iamso<em>glad</em>toseeall. Loremipsumdolorsitamet,consecteturadipiscingelit.Nullamacnibhelit,quisvolutpatnunc.Donecultrices,ipsumquisaccumsanconvallis,tortortortorgravidaante,etsollicitudinipsumnequeeulorem.
打破此输入(请注意不完整的标记):
Hellotoevryone<img height="115" width="150" alt="" src="/Content/Edt/image/b4976875-8dfb-444c-8b32-cc b47b2d81e0.jpg" /Iamsogladtoseeall.
using System;
using System.Text.RegularExpressions;
using System.IO;
using System.Collections.Generic;
public static class CustomSplit {
public static void Main(String[] args) {
if (args.Length > 0 && File.Exists(args[0])) {
StreamReader sr = new StreamReader(args[0]);
String[] lines = sr.ReadToEnd().Split(new String[]{Environment.NewLine}, StringSplitOptions.None);
int counter = 0;
foreach (String line in lines) {
Console.WriteLine("########### Line {0} ###########", ++counter);
Console.WriteLine(line);
Console.WriteLine(line.EnsureWordLength(10));
}
}
}
}
public static class EnsureWordLengthExtension {
public static String EnsureWordLength(this String target, int length) {
List<List<Char>> words = new List<List<Char>>();
words.Add(new List<Char>());
for (int i = 0; i < target.Length; i++) {
words[words.Count - 1].Add(target[i]);
if (target[i] == '<') {
do {
i++;
words[words.Count - 1].Add(target[i]);
} while(target[i] != '>');
}
if ((new String(words[words.Count - 1].ToArray())).CountCharsWithoutTags() == length) {
words.Add(new List<Char>());
}
}
String[] result = new String[words.Count];
for (int j = 0; j < words.Count; j++) {
result[j] = new String(words[j].ToArray());
}
return String.Join(" ", result);
}
private static int CountCharsWithoutTags(this String target) {
return Regex.Replace(target, "<.*?>", "").Length;
}
}