HTML解析失败,带有重音字母(例如:é)

时间:2010-10-07 18:18:29

标签: iphone objective-c ios html-parsing

我正在使用这个库:http://benreeves.co.uk/objective-c-hmtl-parser/来解析我正在制作的一个小iPhone应用程序的HTML。到目前为止,我已经得到了代码,但是当它出现重音时失败了(到目前为止只有经验é)。这是我正在使用的代码:

NSError * error = nil;
HTMLParser * parser = [[HTMLParser alloc] initWithContentsOfURL:[NSURL URLWithString:@"http://intranet.westminster.org.uk/almanack/food.asp?nextweek=TRUE"] error:&error];

if (error) {
    NSLog(@"Error: %@", error);
    return nil;
}
HTMLNode * bodyNode = [parser body]; //Find the body tag
NSArray *individualMeals = [bodyNode findChildTags:@"font"];
for (HTMLNode *node in individualMeals) {
    if ([[node getAttributeNamed:@"color"] isEqual:@"green"]) {
        NSLog(@"%@",[node rawContents]);
    }
}

但它并没有解析所有文本。它在URL中找到重音后似乎放弃了。这是它在运行时产生的结果:

2010-10-07 18:40:59.296 Westminster[1011:207] <font color="green"/>
2010-10-07 18:40:59.298 Westminster[1011:207] <font color="green"/>
2010-10-07 18:40:59.305 Westminster[1011:207] <font color="green"/>
2010-10-07 18:40:59.307 Westminster[1011:207] <font color="green"/>
2010-10-07 18:40:59.308 Westminster[1011:207] <font color="green">Sausage &#13;<br/>Bacon&#13;<br/>Hash Brown&#13;<br/>Baked Beans&#13;<br/>Breakfast special&#13;<br/>Three cheese omelets&#13;<br/><br/><br/>Plain Porridge &#13;<br/><br/><br/><br/>Croissants &#13;<br/><br/> Natural Yogurt&#13;<br/>Dried Fruits &#13;<br/>Granola&#13;<br/>Honey</font>
2010-10-07 18:40:59.309 Westminster[1011:207] <font color="green">Mulligatawny &#13;<br/>Black Olive &#13;<br/>RICE&#13;<br/>Roasted med veg in paella rice&#13;<br/>Hot and sticky wings on yellow rice&#13;<br/>Hoi Sin Pork Belly Steaks&#13;<br/>Vegetable Biriyani with a Mild Curry Sauce&#13;<br/>Babycorn Bamboo Shoots and Water Chestnuts &#13;<br/>Stir fried noodles with seaweed &#13;<br/>Lemon Sponge with Orange Sauce&#13;<br/>Vanilla Granola</font>
2010-10-07 18:40:59.310 Westminster[1011:207] <font color="green"/>
2010-10-07 18:40:59.312 Westminster[1011:207] <font color="green">Pea &amp; Ham &#13;<br/><br/>Black Olive &#13;<br/>Roast Chicken with Bread Sauce and Roast Jus&#13;<br/>Warm Salad of Salmon and Crispy Bacon&#13;<br/><br/><br/>Vegetarian Chilli&#13;<br/>With Sour Cream and Braised Rice&#13;<br/>Green Beans&#13;<br/><br/>Bubble &amp; Squeak&#13;<br/><br/>Tiramisu&#13;<br/>3 Cheeses &amp; Biscuits</font>
2010-10-07 18:40:59.313 Westminster[1011:207] <font color="green">Sausage&#13;<br/>Bacon&#13;<br/>Grilled Tomato&#13;<br/>Grilled mushrooms&#13;<br/>Fried Egg&#13;<br/><br/><br/><br/>Plain Porridge&#13;<br/><br/><br/><br/>Bread &#13;<br/><br/>Natural Yogurt&#13;<br/>Dried Fruits &#13;<br/>Granola&#13;<br/>Honey</font>
2010-10-07 18:40:59.317 Westminster[1011:207] <font color="green">Root Vegetable&#13;<br/>Red Pesto &#13;<br/>WRAP&#13;<br/>Chimichanga&#13;&#13;<br/>Mexican fish tortillas&#13;&#13;<br/>Roast Leg of Lamb &#13;<br/>Gnocchi with Roasted Vegetables and Flaked Parmesan &#13;<br/>Broccoli &#13;<br/><br/><br/>Thyme Roasted Potatoes  &#13;<br/> Sticky Toffee Pudding and Toffee Sauce&#13;<br/>Banana Bread</font>
2010-10-07 18:40:59.318 Westminster[1011:207] <font color="green"/>
2010-10-07 18:40:59.318 Westminster[1011:207] <font color="green">Tomato with Basil Oil&#13;&#13;<br/>Red Pesto &#13;<br/>Beef Olives&#13;<br/><br/>Lamb with Ginger, Spring onion and Noodles&#13;<br/><br/><br/>Field Mushroom Pies&#13;&#13;<br/>Ratatouille &#13;<br/><br/>Creamed Potatoes&#13;<br/><br/>Lemon Tart&#13;<br/>3 Cheeses &amp; Biscuits</font>
2010-10-07 18:40:59.319 Westminster[1011:207] <font color="green">Sausage &#13;<br/>Bacon&#13;<br/>Baked Beans&#13;<br/>Grilled Tomato&#13;<br/>Breakfast special&#13;<br/>Avocado on toast&#13;<br/><br/>Plain Porridge &#13;<br/><br/><br/>Bread and banana bread&#13;<br/><br/>Natural Yogurt&#13;<br/>Dried Fruits &#13;<br/>Granola&#13;<br/>Honey</font>
2010-10-07 18:40:59.333 Westminster[1011:207] <font color="green">(GREEK)&#13;<br/><br/>FLAT BREADS&#13;<br/>SPINACH, ROCKET AND FETA AND TOASTED SOUR DOUGHS &#13;<br/>SEAFOOD STUFFED PEPPERS&#13;<br/>STIFADO (beef)&#13;<br/><br/>LAMB FRICASSEE&#13;<br/>zucchini pie from Macedonia&#13;&#13;<br/>RICE&#13;<br/><br/>GIGANTIS PLAKI&#13;<br/><br/>ORANGE AND LEMON CAKE TOPPED WITH GREEK YOGURT AND HONEY</font>
2010-10-07 18:40:59.333 Westminster[1011:207] <font color="green"/>
2010-10-07 18:40:59.334 Westminster[1011:207] <font color="green">Roasted Vegetable&#13;<br/>FLAT BREADS&#13;<br/>Pork Steak Served with a Tomato, Tarragon and Mushroom sauce&#13;<br/>Roast beef and homemade horseradish sauce&#13;<br/><br/><br/>Lancashire Cheese Sausages with Onion Gravy&#13;&#13;<br/>Courgettes&#13;<br/><br/>Roast Potatoes&#13;<br/><br/>Mississippi Mud Pie&#13;<br/>3 Cheeses &amp; Biscuits</font>
2010-10-07 18:40:59.343 Westminster[1011:207] <font color="green">Sausage&#13;<br/>Bacon &#13;<br/>Hash Brown&#13;<br/>Grilled mushrooms&#13;<br/>Fried Egg&#13;<br/><br/><br/><br/>Plain Porridge &#13;<br/><br/><br/><br/>Bread &#13;<br/><br/> Natural Yogurt&#13;<br/>Dried Fruits &#13;<br/>Granola&#13;<br/>Honey</font>
2010-10-07 18:40:59.344 Westminster[1011:207] <font color="green">Leek, Blue Cheese and Potato &#13;&#13;<br/>Sunflower Seed&#13;<br/>COUS COUS&#13;<br/>Couscous with apricots, lemon and coriander &#13;<br/><br/>Couscous fried chicken with couscous and spiced tomato sauce &#13;&#13;<br/>Butchers Sausages &#13;<br/>Balsamic Roasted Vegetable Frittata&#13;<br/>Red Cabbage&#13;<br/><br/><br/>Mashed Potatoes&#13;<br/><br/>Jam Roly Poly &#13;<br/>Bakewell Slice</font>
2010-10-07 18:40:59.344 Westminster[1011:207] <font color="green"/>
2010-10-07 18:40:59.345 Westminster[1011:207] <font color="green">Curried Parsnip and Apple &#13;&#13;<br/>Sunflower Seed&#13;<br/>Spiced Sticky chicken pieces&#13;<br/>Mexican Beef Chilli Wraps with Natural Yogurt and Guacamole&#13;<br/><br/><br/>Roasted Teriyaki Tofu Steaks with Glazed Green Vegetables&#13;&#13;<br/>Spiced Aubergine&#13;<br/><br/>Rice and Peas&#13;<br/><br/>Mango Mousse&#13;<br/>3 Cheeses &amp; Biscuits</font>
2010-10-07 18:40:59.351 Westminster[1011:207] <font color="green">Sausage&#13;<br/>Bacon&#13;<br/>Baked Beans&#13;<br/>Grilled Tomato&#13;<br/>Breakfast special&#13;<br/>Muffin bar&#13;<br/><br/>Plain Porridge&#13;<br/><br/><br/><br/>Croissants &#13;<br/><br/>Natural Yogurt&#13;<br/>Dried Fruits &#13;<br/>Granola&#13;<br/>Honey</font>
2010-10-07 18:40:59.352 Westminster[1011:207] <font color="green">Carrot and Chilli &#13;&#13;<br/>Rosemary &#13;<br/>NOODLES&#13;<br/><br/>Crispy tofu&#13;<br/>Lemon chicken&#13;<br/>Fish with Traditional Crispy Batter&#13;<br/>Japanese Vegetable Curry with Rice Noodles and Tofu&#13;&#13;<br/>Garden peas&#13;<br/><br/><br/>Chips &#13;<br/>Viennese Jam Tart and Custard &#13;<br/>Fresh Fruit Salad</font>
2010-10-07 18:40:59.361 Westminster[1011:207] <font color="green"/>
2010-10-07 18:40:59.361 Westminster[1011:207] <font color="green">Three onion, spring, red and white&#13;<br/>Rosemary &#13;<br/>Pepperoni Pizza Topped with Boccaccio&#13;<br/>Bolognaise pasta bake&#13;<br/><br/>Vegetarian Plait&#13;&#13;<br/>Green Cabbage&#13;<br/><br/>Oven Baked Cajun Wedges&#13;<br/><br/>Ice&#13;<br/>Cream Sundae&#13;<br/><br/>3 Cheeses &amp; Biscuits</font>
2010-10-07 18:40:59.362 Westminster[1011:207] <font color="green">Sausage &#13;<br/>Bacon &#13;<br/>Hash Brown&#13;<br/>Grilled Mushrooms&#13;<br/>Poached Eggs&#13;<br/><br/><br/><br/>Plain Porridge &#13;<br/><br/><br/><br/><br/><br/>Natural Yogurt&#13;<br/>Dried Fruits &#13;<br/>Granola&#13;<br/>Honey</font>
2010-10-07 18:40:59.362 Westminster[1011:207] (null)
2010-10-07 18:40:59.363 Westminster[1011:207] <font color="green"/>
2010-10-07 18:40:59.363 Westminster[1011:207] <font color="green"/>

它在该部分放弃了炒土豆,并且不会从该部分或任何后面部分返回任何结果。

我认为这可能是因为该网站没有对és进行编码。当我查看来源时,我看到é而不是&amp; eacute; (不含空格,否则按其格式化......),如本网站所建议:http://www.w3.org/MarkUp/html3/latin1.html

感谢您的时间。如果你知道从网站上获取午餐的更好方法,我也很乐意听到它。

2 个答案:

答案 0 :(得分:0)

我有类似的问题。问题是libxml2只能解析UTF8编码的文件。所以你需要先将html页面转换为UTF8。

答案 1 :(得分:0)

我刚试过这个。我不会使用initWithContentsOfURL方法,而是使用connectionDidFinishLoading委托方法中的initWithData方法,例如

HTMLParser * parser = [[HTMLParser alloc] initWithData:receivedData error:&error];

似乎可以使用特殊字符和其他编码更好地工作。