如何使用java从以下代码片段中提取所有http链接?

时间:2012-04-21 11:50:15

标签: java regex json

我从自定义结果中获得了以下结果。

{
    "kind": "customsearch#search",
    "url": {
        "type": "application/json",
        "template": "https://www.googleapis.com/customsearch/v1?q={searchTerms}& ={count?}&    start={startIndex?}&lr={language?}&safe={safe?}&cx={cx?}&cref={cref?}&sort={sort?}&filter={filter?}&gl={gl?}&cr={cr?}&googlehost={googleHost?}&c2coff={disableCnTwTranslation?}&hq={hq?}&hl={hl?}&nsc={nsc?}&siteSearch={siteSearch?}&siteSearchFilter={siteSearchFilter?}&exactTerms={exactTerms?}&excludeTerms={excludeTerms?}&linkSite={linkSite?}&orTerms={orTerms?}&relatedSite={relatedSite?}&dateRestrict={dateRestrict?}&lowRange={lowRange?}&highRange={highRange?}&searchType={searchType}&fileType={fileType?}&rights={rights?}&imgSize={imgSize?}&imgType={imgType?}&imgColorType={imgColorType?}&imgDominantColor={imgDominantColor?}&alt=json"
    },
    "queries": {
        "nextPage": [
            {
                "title": "Google Custom Search - flowers",
                "totalResults": 10300000,
                "searchTerms": "flowers",
                "count": 10,
                "startIndex": 11,
                "inputEncoding": "utf8",
                "outputEncoding": "utf8",
                "cx": "013036536707430787589:_pqjad5hr1a"
            }
        ],
        "request": [
            {
                "title": "Google Custom Search - flowers",
                "totalResults": 10300000,
                "searchTerms": "flowers",
                "count": 10,
                "startIndex": 1,
                "inputEncoding": "utf8",
                "outputEncoding": "utf8",
                "cx": "013036536707430787589:_pqjad5hr1a"
            }
        ]
    },
    "context": {
        "title": "Custom Search"
    },
    "items": [
        {
            "kind": "customsearch#result",
            "title": "Flower - Wikipedia, the free encyclopedia",
            "htmlTitle": "<b>Flower</b> - Wikipedia, the free encyclopedia",
            "link": "http://en.wikipedia.org/wiki/Flower",
            "displayLink": "en.wikipedia.org",
            "snippet": "A flower, sometimes known as a bloom or blossom, is the reproductive structure found in flowering plants (plants of the division Magnoliophyta, ...",
            "htmlSnippet": "A <b>flower</b>, sometimes known as a bloom or blossom, is the reproductive structure <br>  found in flowering plants (plants of the division Magnoliophyta, <b>... </b>",
            "pagemap": {
                "RTO": [
                    {
                        "format": "image",
                        "group_impression_tag": "prbx_kr_rto_term_enc",
                        "Opt::max_rank_top": "0",
                        "Opt::threshold_override": "3",
                        "Opt::disallow_same_domain": "1",
                        "Output::title": "<b>Flower</b>",
                        "Output::want_title_on_right": "true",
                        "Output::num_lines1": "3",
                        "Output::text1": "꽃은 식물 에서 씨 를 만들어 번식 기능을 수행하는 생식 기관 을 말한다. 꽃을 형태학적으로 관찰하여 최초로 총괄한 사람은 식물계를 24강으로 분류한 린네 였다. 그 후 꽃은 식물분류학상중요한 기준이 되었다.",
                        "Output::gray1b": "- 위키백과",
                        "Output::no_clip1b": "true",
                        "UrlOutput::url2": "http://en.wikipedia.org/wiki/Flower",
                        "Output::link2": "위키백과 (영문)",
                        "Output::text2b": "   ",
                        "UrlOutput::url2c": "http://ko.wikipedia.org/wiki/꽃",
                        "Output::link2c": "위키백과",
                        "result_group_header": "백과사전",
                        "Output::image_url": "http://www.gstatic.com/richsnippets/b/fcb6ee50e488743f.jpg",
                        "image_size": "80x80",
                        "Output::inline_image_width": "80",
                        "Output::inline_image_height": "80",
                        "Output::image_border": "1"
                    }
                ]
            }
        }
    ]
}

如何使用java从上面的代码中提取所有https链接?

4 个答案:

答案 0 :(得分:2)

您可能很懒惰并忽略解析JSON,将整个结果视为字符串,只使用正则表达式来匹配URL。

String httpLinkPattern = "https?://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";
Pattern p = Pattern.compile(httpLinkPattern);
Matcher m = p.matcher(jsonResult);
while (m.find())
  System.out.println("Found http link: "+m.group());

答案 1 :(得分:0)

如果您希望将响应转换为字符串进行操作,从而提取URL而不是使用JSON库,那么下面应该这样做。

  public List<String> extractUrls(String input)
  {
    List<String> result = new ArrayList<String>();
    Pattern pattern =
        Pattern.compile("\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" + "(\\w+:\\w+@)?(([-\\w]+\\.)+(com|org|net|gov"
            + "|mil|biz|info|mobi|name|aero|jobs|museum" + "|travel|[a-z]{2}))(:[\\d]{1,5})?"
            + "(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" + "((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?"
            + "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" + "(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*"
            + "(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b");

    Matcher matcher = pattern.matcher(input);
    while (matcher.find())
    {
      result.add(matcher.group());
    }

    return result;
  }

<强>用法:

    List<String> links = extractUrls(jsonResponseString);
    for (String link : links)
    {
      System.out.println(link);
    }

答案 2 :(得分:0)

请使用JSON Parser执行此操作。我认为这将是最好的。请参考以下链接以获取很好的示例

Java code to parse JSON

答案 3 :(得分:0)

https?://.*?\.(org|com|net|gov)/.*?(?=")

此正则表达式适用于您的目的。 http://regexr.com?30nm2