这个问题很具挑战性。我们的应用程序允许用户在主页上发布新闻。该新闻是通过允许HTML的富文本编辑器输入的。在主页上,我们只想显示新闻项的截断摘要。
例如,这是我们正在显示的全文,包括HTML
为了在办公室,厨房里腾出更多的空间,我把所有随机的杯子拿出来放在午餐室的桌子上。 除非您对1992年的Cheyenne Courier杯子或1997年的BC Tel Advanced Communications杯子的所有权感到强烈,否则它们将被放入一个盒子里并捐赠给比我们更需要杯子的办公室。
我们希望将新闻项目修剪为250个字符,但不包括HTML。
我们用于修剪的方法目前包含HTML,这导致一些HTML重大的新闻帖子被大量截断。
例如,如果上面的示例包含大量HTML,则可能看起来像这样:
为了在办公室,厨房里腾出更多空间,我已经拉了......
这不是我们想要的。
有没有人有办法对HTML标记进行标记,以便在字符串中保持位置,对字符串执行长度检查和/或修剪,并将字符串中的HTML恢复到旧位置?
答案 0 :(得分:10)
从帖子的第一个字符开始,逐步浏览每个字符。每次跳过一个角色,都会增加一个计数器。当你找到一个'<'字符,停止递增计数器,直到你点击'>'字符。当你的计数器达到250时,你的位置就是你真正要切断的位置。
请注意,当打开HTML标记但在截止日期之前未关闭时,您将需要处理另一个问题。
答案 1 :(得分:2)
遵循2状态有限机器建议,我刚刚用Java开发了一个简单的HTML解析器:
这里是一个测试案例:
这里是Java代码:
import java.util.Collections;
import java.util.LinkedList;
import java.util.List;
public class HtmlShortener {
private static final String TAGS_TO_SKIP = "br,hr,img,link";
private static final String[] tagsToSkip = TAGS_TO_SKIP.split(",");
private static final int STATUS_READY = 0;
private int cutPoint = -1;
private String htmlString = "";
final List<String> tags = new LinkedList<String>();
StringBuilder sb = new StringBuilder("");
StringBuilder tagSb = new StringBuilder("");
int charCount = 0;
int status = STATUS_READY;
public HtmlShortener(String htmlString, int cutPoint){
this.cutPoint = cutPoint;
this.htmlString = htmlString;
}
public String cut(){
// reset
tags.clear();
sb = new StringBuilder("");
tagSb = new StringBuilder("");
charCount = 0;
status = STATUS_READY;
String tag = "";
if (cutPoint < 0){
return htmlString;
}
if (null != htmlString){
if (cutPoint == 0){
return "";
}
for (int i = 0; i < htmlString.length(); i++){
String strC = htmlString.substring(i, i+1);
if (strC.equals("<")){
// new tag or tag closure
// previous tag reset
tagSb = new StringBuilder("");
tag = "";
// find tag type and name
for (int k = i; k < htmlString.length(); k++){
String tagC = htmlString.substring(k, k+1);
tagSb.append(tagC);
if (tagC.equals(">")){
tag = getTag(tagSb.toString());
if (tag.startsWith("/")){
// closure
if (!isToSkip(tag)){
sb.append("</").append(tags.get(tags.size() - 1)).append(">");
tags.remove((tags.size() - 1));
}
} else {
// new tag
sb.append(tagSb.toString());
if (!isToSkip(tag)){
tags.add(tag);
}
}
i = k;
break;
}
}
} else {
sb.append(strC);
charCount++;
}
// cut check
if (charCount >= cutPoint){
// close previously open tags
Collections.reverse(tags);
for (String t : tags){
sb.append("</").append(t).append(">");
}
break;
}
}
return sb.toString();
} else {
return null;
}
}
private boolean isToSkip(String tag) {
if (tag.startsWith("/")){
tag = tag.substring(1, tag.length());
}
for (String tagToSkip : tagsToSkip){
if (tagToSkip.equals(tag)){
return true;
}
}
return false;
}
private String getTag(String tagString) {
if (tagString.contains(" ")){
// tag with attributes
return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(" "));
} else {
// simple tag
return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(">"));
}
}
}
答案 2 :(得分:0)
如果我正确理解了问题,您希望保留HTML格式,但是您不希望将其视为您保留的字符串长度的一部分。
您可以使用实现简单finite state machine的代码来完成此任务。
2州:InTag,OutOfTag
InTag:
- 如果遇到>
字符,则转到OutOfTag - 遇到任何其他角色 OutOfTag:
- 如果遇到<
字符,则转到InTag - 遇到任何其他角色
您的起始状态为OutOfTag。
通过一次处理1个字符来实现有限状态机。每个角色的处理都会带您进入一个新状态。
当您通过有限状态机运行文本时,您还希望保留输出缓冲区和迄今为止遇到的长度可变(因此您知道何时停止)。
答案 3 :(得分:0)
这是我在C#中提出的实现:
public static string TrimToLength(string input, int length)
{
if (string.IsNullOrEmpty(input))
return string.Empty;
if (input.Length <= length)
return input;
bool inTag = false;
int targetLength = 0;
for (int i = 0; i < input.Length; i++)
{
char c = input[i];
if (c == '>')
{
inTag = false;
continue;
}
if (c == '<')
{
inTag = true;
continue;
}
if (inTag || char.IsWhiteSpace(c))
{
continue;
}
targetLength++;
if (targetLength == length)
{
return ConvertToXhtml(input.Substring(0, i + 1));
}
}
return input;
}
我通过TDD使用了一些单元测试:
[Test]
public void Html_TrimReturnsEmptyStringWhenNullPassed()
{
Assert.That(Html.TrimToLength(null, 1000), Is.Empty);
}
[Test]
public void Html_TrimReturnsEmptyStringWhenEmptyPassed()
{
Assert.That(Html.TrimToLength(string.Empty, 1000), Is.Empty);
}
[Test]
public void Html_TrimReturnsUnmodifiedStringWhenSameAsLength()
{
string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I";
Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(source));
}
[Test]
public void Html_TrimWellFormedHtml()
{
string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
"In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>" +
"</div>";
string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";
Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(expected));
}
[Test]
public void Html_TrimMalformedHtml()
{
string malformedHtml = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
"In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>";
string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";
Assert.That(Html.TrimToLength(malformedHtml, 250), Is.EqualTo(expected));
}
答案 4 :(得分:0)
我知道这是在发布日期之后的相当多,但我有一个类似的问题,这就是我最终解决它的方式。我担心的是正则表达式与通过数组进行交互的速度。
此外,如果您在html标记之前有空格,并且在此之后无法解决该问题
private string HtmlTrimmer(string input, int len)
{
if (string.IsNullOrEmpty(input))
return string.Empty;
if (input.Length <= len)
return input;
// this is necissary because regex "^" applies to the start of the string, not where you tell it to start from
string inputCopy;
string tag;
string result = "";
int strLen = 0;
int strMarker = 0;
int inputLength = input.Length;
Stack stack = new Stack(10);
Regex text = new Regex("^[^<&]+");
Regex singleUseTag = new Regex("^<[^>]*?/>");
Regex specChar = new Regex("^&[^;]*?;");
Regex htmlTag = new Regex("^<.*?>");
while (strLen < len)
{
inputCopy = input.Substring(strMarker);
//If the marker is at the end of the string OR
//the sum of the remaining characters and those analyzed is less then the maxlength
if (strMarker >= inputLength || (inputLength - strMarker) + strLen < len)
break;
//Match regular text
result += text.Match(inputCopy,0,len-strLen);
strLen += result.Length - strMarker;
strMarker = result.Length;
inputCopy = input.Substring(strMarker);
if (singleUseTag.IsMatch(inputCopy))
result += singleUseTag.Match(inputCopy);
else if (specChar.IsMatch(inputCopy))
{
//think of as 1 character instead of 5
result += specChar.Match(inputCopy);
++strLen;
}
else if (htmlTag.IsMatch(inputCopy))
{
tag = htmlTag.Match(inputCopy).ToString();
//This only works if this is valid Markup...
if(tag[1]=='/') //Closing tag
stack.Pop();
else //not a closing tag
stack.Push(tag);
result += tag;
}
else //Bad syntax
result += input[strMarker];
strMarker = result.Length;
}
while (stack.Count > 0)
{
tag = stack.Pop().ToString();
result += tag.Insert(1, "/");
}
if (strLen == len)
result += "...";
return result;
}
答案 5 :(得分:0)
答案 6 :(得分:-1)
最快的方法不是使用jQuery的text()
方法吗?
例如:
<ul>
<li>One</li>
<li>Two</li>
<li>Three</li>
</ul>
var text = $('ul').text();
会在text
变量中赋予值OneTwoThree。这样您就可以在不包含HTML的情况下获取文本的实际长度。