在HTML字符串中搜索链接(Objective-C)

时间:2018-11-05 00:15:21

标签: objective-c xcode web-scraping html-parsing

这可能有点令人困惑,因为我对Objective-C有点陌生。我的应用程序已经获取了源代码:

NSURL *URL = [NSURL URLWithString:@"google.com"];
NSString *webData= [NSString stringWithContentsOfURL:URL encoding:NSASCIIStringEncoding error:nil];

那可以正确获取源代码,我已经记录并检查了。我只想查找该字符串中的链接,所以所有带有关键字的内容:

<a href

我尝试搜索字符串,如下所示:

 if ([webData containsString:@"<a href="]) {
    NSLog(@"string contains!");
} else {
    NSLog(@"string does not contain");
}

它总是返回负数,我不明白为什么。我只想获取包含链接的代码行并将这些行设置为新字符串。该字符串将包含源上的所有链接,但是我不知道该怎么做。希望我能提供足够的信息,如果您对我的问题有任何疑问,请询问。谢谢。

编辑1 我尝试了给出的答案,这是我的以下代码

NSURL *URL = [NSURL URLWithString:@"google.com"];
NSString *webData= [NSString stringWithContentsOfURL:URL encoding:NSASCIIStringEncoding error:nil];
NSError *error = NULL;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:@"\<a href=\"(.*)\".*<\/a\>"
                                                                       options:NSRegularExpressionCaseInsensitive
                                                                         error:&error];
NSUInteger numberOfMatches = [regex matchesInString:webData
                                                    options:0
                                                      range:NSMakeRange(0, [webData length])];

首先它不起作用,并且出现以下错误/警告:warnings

编辑2 我已经尝试修复代码,目前是

NSURL *URL = [NSURL URLWithString:@"google.com"];
NSString *webData= [NSString stringWithContentsOfURL:URL encoding:NSASCIIStringEncoding error:nil];
NSError *error = NULL;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:@"\\<a.+?\\>.+?\\<\\/a\\>"
                                                                       options:NSRegularExpressionCaseInsensitive
                                                                         error:&error];
NSArray *matches = [regex matchesInString:webData
                          options:0
                          range:NSMakeRange(0, [webData length])];
NSLog(@"%@", matches);

这是正在输出的日志:

2018-11-05 00:12:51.144009-0500 InjectionTest[42684:6739102] (
"<NSSimpleRegularExpressionCheckingResult: 0x6000037b2c00>{25654, 124}{<NSRegularExpression: 0x600002ca0210> \\<a.+?\\>.+?\\<\\/a\\> 0x1}",
"<NSSimpleRegularExpressionCheckingResult: 0x6000037b2cc0>{38864, 316}{<NSRegularExpression: 0x600002ca0210> \\<a.+?\\>.+?\\<\\/a\\> 0x1}",
"<NSSimpleRegularExpressionCheckingResult: 0x6000037b2340>{39939, 105}{<NSRegularExpression: 0x600002ca0210> \\<a.+?\\>.+?\\<\\/a\\> 0x1}",
"<NSSimpleRegularExpressionCheckingResult: 0x6000037b2100>{40051, 103}{<NSRegularExpression: 0x600002ca0210> \\<a.+?\\>.+?\\<\\/a\\> 0x1}",
"<NSSimpleRegularExpressionCheckingResult: 0x6000037b2000>{40203, 125}{<NSRegularExpression: 0x600002ca0210> \\<a.+?\\>.+?\\<\\/a\\> 0x1}",
"<NSSimpleRegularExpressionCheckingResult: 0x6000037b2140>{41190, 91}{<NSRegularExpression: 0x600002ca0210> \\<a.+?\\>.+?\\<\\/a\\> 0x1}",
"<NSSimpleRegularExpressionCheckingResult: 0x6000037b0f00>{41297, 67}{<NSRegularExpression: 0x600002ca0210> \\<a.+?\\>.+?\\<\\/a\\> 0x1}",
"<NSSimpleRegularExpressionCheckingResult: 0x6000037b2d80>{41479, 124}{<NSRegularExpression: 0x600002ca0210> \\<a.+?\\>.+?\\<\\/a\\> 0x1}"

我很确定那不是我应该得到的。

1 个答案:

答案 0 :(得分:0)

我建议改用NSRegularExpression

通过使用适当的模式,例如:

\<a href=\"(.*)\".*<\/a\>

具有以下功能:

 matchesInString:options:range:

您将在HTML字符串中获得A元素的列表。

更多详细信息,请在Apple官方文档中阅读:

https://developer.apple.com/documentation/foundation/nsregularexpression?language=objc

** 更新 **

从HTML文本中提取所有<a>元素的示例代码:

NSURL *url = [NSURL URLWithString:@"https://www.google.com"];
NSString *html = [NSString stringWithContentsOfURL:url encoding:NSUTF8StringEncoding error:nil];
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:@"\\<a.+?\\>.+?\\<\\/a\\>"
                                                                       options:NSRegularExpressionCaseInsensitive
                                                                         error:nil];
NSArray *matches = [regex matchesInString:html options:0 range:NSMakeRange(0, html.length)];
for (NSTextCheckingResult *match in matches) {
    NSRange matchRange = [match range];
    NSString *matchedString = [html substringWithRange:matchRange];
    NSLog(@"%@", matchedString);
}

这是上面代码的日志:

2018-11-05 14:30:01.252702 TestArray[1322:320814] <a class=gb1 href="https://www.google.co.jp/imghp?hl=ja&tab=wi">&#30011;&#20687;</a>
2018-11-05 14:30:01.252911 TestArray[1322:320814] <a class=gb1 href="https://maps.google.co.jp/maps?hl=ja&tab=wl">&#12510;&#12483;&#12503;</a>
2018-11-05 14:30:01.253120 TestArray[1322:320814] <a class=gb1 href="https://play.google.com/?hl=ja&tab=w8">Play</a>
2018-11-05 14:30:01.253346 TestArray[1322:320814] <a class=gb1 href="https://www.youtube.com/?gl=JP&tab=w1">YouTube</a>
2018-11-05 14:30:01.253512 TestArray[1322:320814] <a class=gb1 href="https://news.google.co.jp/nwshp?hl=ja&tab=wn">&#12491;&#12517;&#12540;&#12473;</a>
2018-11-05 14:30:01.253638 TestArray[1322:320814] <a class=gb1 href="https://mail.google.com/mail/?tab=wm">Gmail</a>
2018-11-05 14:30:01.253750 TestArray[1322:320814] <a class=gb1 href="https://drive.google.com/?tab=wo">&#12489;&#12521;&#12452;&#12502;</a>
2018-11-05 14:30:01.253934 TestArray[1322:320814] <a class=gb1 style="text-decoration:none" href="https://www.google.co.jp/intl/ja/options/"><u>&#12418;&#12387;&#12392;&#35211;&#12427;</u> &raquo;</a>
2018-11-05 14:30:01.254049 TestArray[1322:320814] <a href="http://www.google.co.jp/history/optout?hl=ja" class=gb4>&#12454;&#12455;&#12502;&#23653;&#27508;</a>
2018-11-05 14:30:01.254164 TestArray[1322:320814] <a  href="/preferences?hl=ja" class=gb4>&#35373;&#23450;</a>
2018-11-05 14:30:01.254274 TestArray[1322:320814] <a target=_top id=gb_70 href="https://accounts.google.com/ServiceLogin?hl=ja&passive=true&continue=https://www.google.com/" class=gb4>&#12525;&#12464;&#12452;&#12531;</a>
2018-11-05 14:30:01.254434 TestArray[1322:320814] <a href="/advanced_search?hl=ja&amp;authuser=0">&#26908;&#32034;&#12458;&#12503;&#12471;&#12519;&#12531;</a>
2018-11-05 14:30:01.254739 TestArray[1322:320814] <a href="/language_tools?hl=ja&amp;authuser=0">&#35328;&#35486;&#12484;&#12540;&#12523;</a>
2018-11-05 14:30:01.254900 TestArray[1322:320814] <a href="https://www.google.com/setprefs?sig=0_hs-qGLtJFycvdIdXbi2jQdSOY4s%3D&amp;hl=en&amp;source=homepage&amp;sa=X&amp;ved=0ahUKEwi_tb7pwrzeAhULwLwKHZ9aDiAQ2ZgBCAU">English</a>
2018-11-05 14:30:01.255072 TestArray[1322:320814] <a href="/intl/ja/ads/">&#24195;&#21578;&#25522;&#36617;</a>
2018-11-05 14:30:01.255182 TestArray[1322:320814] <a href="http://www.google.co.jp/intl/ja/services/">&#12499;&#12472;&#12493;&#12473; &#12477;&#12522;&#12517;&#12540;&#12471;&#12519;&#12531;</a>
2018-11-05 14:30:01.255453 TestArray[1322:320814] <a href="https://plus.google.com/115899767381375908215" rel="publisher">+Google</a>
2018-11-05 14:30:01.255609 TestArray[1322:320814] <a href="/intl/ja/about.html">Google &#12395;&#12388;&#12356;&#12390;</a>
2018-11-05 14:30:01.255722 TestArray[1322:320814] <a href="https://www.google.com/setprefdomain?prefdom=JP&amp;prev=https://www.google.co.jp/&amp;sig=K_erqW_iZ2bjJu2TsKii5UfNnAGcg%3D">Google.co.jp</a>
2018-11-05 14:30:01.255832 TestArray[1322:320814] <a href="/intl/ja/policies/privacy/">&#12503;&#12521;&#12452;&#12496;&#12471;&#12540;</a>
2018-11-05 14:30:01.256001 TestArray[1322:320814] <a href="/intl/ja/policies/terms/">&#35215;&#32004;</a>