当我从源代码解析HTML文本时,段落之间存在任意空格

时间:2012-10-25 06:23:02

标签: objective-c html-parsing

我将网站的源代码存储到字符串中,并成功删除了所有的html标记。但是,段落之间存在随机空格。有时它只会是一行,其他时候会是4行或5行。

这就是我做的事情

- (NSString *)parseHTMLText:(NSString *)text {

    NSString *startingPt = @"<!-- (START) Pagination Content Wrapper -->";
    NSString *endingPt = @"<!-- (END) Pagination Content Wrapper -->";

    //isolate body text from entire source code
    NSString *leftTrimmed = [text substringFromIndex:NSMaxRange([text rangeOfString:startingPt])] ;
    NSString *completeTrimmed = [leftTrimmed substringToIndex:[leftTrimmed rangeOfString:endingPt].location];
    completeTrimmed = [completeTrimmed stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];

    completeTrimmed =  [self removeHTMlTagsFromString:completeTrimmed];
    completeTrimmed = [completeTrimmed stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];

    return completeTrimmed; 

}

- (NSString *)removeHTMlTagsFromString:(NSString *)text {

    //check if there are any html tags

    if ([text rangeOfString:@"<"].location != NSNotFound && [text rangeOfString:@">"].location != NSNotFound) {
        //find first index of "<"
        int startIndex = [text rangeOfString:@"<"].location;
        NSString *startOfTag =  [text substringFromIndex:startIndex];
        // find length to ">"
        int length = [startOfTag rangeOfString:@">"].location + 1;
        text = [text stringByReplacingCharactersInRange:NSMakeRange(startIndex, length) withString:@""];

        text = [self removeHTMlTagsFromString:text];
    }

    return text;

}

我尝试了这个,但它不起作用

completeTrimmed = [completeTrimmed stringByReplacingOccurrencesOfString:@"     " withString:@""];

1 个答案:

答案 0 :(得分:0)

如果原始HTML看起来像:

<tag1>
<tag2>
<tag3>
text text text
</tag3>
</tag2>
</tag1>

然后,当您删除所有标记时,您仍然会有分隔它们的换行符。

使用DOM解析库而不是原始字符串函数,您的问题应该得到解决。