我将网站的源代码存储到字符串中,并成功删除了所有的html标记。但是,段落之间存在随机空格。有时它只会是一行,其他时候会是4行或5行。
这就是我做的事情
- (NSString *)parseHTMLText:(NSString *)text {
NSString *startingPt = @"<!-- (START) Pagination Content Wrapper -->";
NSString *endingPt = @"<!-- (END) Pagination Content Wrapper -->";
//isolate body text from entire source code
NSString *leftTrimmed = [text substringFromIndex:NSMaxRange([text rangeOfString:startingPt])] ;
NSString *completeTrimmed = [leftTrimmed substringToIndex:[leftTrimmed rangeOfString:endingPt].location];
completeTrimmed = [completeTrimmed stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
completeTrimmed = [self removeHTMlTagsFromString:completeTrimmed];
completeTrimmed = [completeTrimmed stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
return completeTrimmed;
}
- (NSString *)removeHTMlTagsFromString:(NSString *)text {
//check if there are any html tags
if ([text rangeOfString:@"<"].location != NSNotFound && [text rangeOfString:@">"].location != NSNotFound) {
//find first index of "<"
int startIndex = [text rangeOfString:@"<"].location;
NSString *startOfTag = [text substringFromIndex:startIndex];
// find length to ">"
int length = [startOfTag rangeOfString:@">"].location + 1;
text = [text stringByReplacingCharactersInRange:NSMakeRange(startIndex, length) withString:@""];
text = [self removeHTMlTagsFromString:text];
}
return text;
}
我尝试了这个,但它不起作用
completeTrimmed = [completeTrimmed stringByReplacingOccurrencesOfString:@" " withString:@""];
答案 0 :(得分:0)
如果原始HTML看起来像:
<tag1>
<tag2>
<tag3>
text text text
</tag3>
</tag2>
</tag1>
然后,当您删除所有标记时,您仍然会有分隔它们的换行符。
使用DOM解析库而不是原始字符串函数,您的问题应该得到解决。