好吧,所以我正在开发一个网页抓取工具,可以将网页转换为文本段落。要删除标签本身,我在Stack Overflow上找到了这个:
- (NSString *) stripTags:(NSString *)str
{
NSMutableString *ms = [NSMutableString stringWithCapacity:[str length]];
NSScanner *scanner = [NSScanner scannerWithString:str];
[scanner setCharactersToBeSkipped:nil];
NSString *s = nil;
while (![scanner isAtEnd])
{
[scanner scanUpToString:@"<" intoString:&s];
if (s != nil)
[ms appendString:s];
[scanner scanUpToString:@">" intoString:NULL];
if (![scanner isAtEnd])
[scanner setScanLocation:[scanner scanLocation]+1];
s = nil;
}
return ms;
}
然而,它有效,它只删除了标签,而不是脚本和样式标签之间的内容(显然我不希望删除所有标签之间的内容,因为这会导致在空字符串中。)
有什么方法可以让脚本和样式标签被截断?
提前多多感谢。
编辑:
我尝试将代码更改为:
- (NSString *) stripTags:(NSString *)str
{
NSMutableString *ms = [NSMutableString stringWithCapacity:[str length]];
NSScanner *scanner = [NSScanner scannerWithString:str];
[scanner setCharactersToBeSkipped:nil];
NSString *s = nil;
while (![scanner isAtEnd])
{
[scanner scanUpToString:@"<script" intoString:&s];
if (s != nil)
[ms appendString:s];
[scanner scanUpToString:@"script>" intoString:NULL];
if (![scanner isAtEnd])
[scanner setScanLocation:[scanner scanLocation]+1];
[scanner scanUpToString:@"<" intoString:&s];
if (s != nil)
[ms appendString:s];
[scanner scanUpToString:@">" intoString:NULL];
if (![scanner isAtEnd])
[scanner setScanLocation:[scanner scanLocation]+1];
s = nil;
}
return ms;
}
但脚本和css仍在包含
答案 0 :(得分:1)
您可以编辑扫描仪代码,以便检查标签。如果标签是您要删除的标签,则可以扫描到结束标签,然后丢弃该字符串。那么你就不能存储/附加字符串。
阅读标签start(<
)',然后阅读标签,以便检查它是什么。然后读取标签关闭并放下或保存。
从类似内容开始(以内联方式输入,不以任何方式测试):
while (![scanner isAtEnd])
{
[scanner scanUpToString:@"<" intoString:&s];
if (s != nil)
[ms appendString:s];
[scanner scanUpToString:@">" intoString:&t];
if ([t isEqualToString:@"tagToIgnore"]) {
[scanner scanUpToString:@"<" intoString:NULL];
[scanner setScanLocation:[scanner scanLocation]-1];
s = nil;
t = nil;
continue;
}
if (![scanner isAtEnd])
[scanner setScanLocation:[scanner scanLocation]+1];
s = nil;
t = nil;
}