使用没有img标签的jsoup提取图像

时间:2018-07-23 18:05:05

标签: java html jsoup

我需要提取div内的图像,并且src不在img标签内。我不能执行getElementById(),因为它在页面之间有所不同。在这种情况下,我可以使用一些正则表达式从doc中提取图像吗?任何帮助表示赞赏。

HTML片段:

<div 
    class="rendition-bg rendition-bg--alignment desktop-center-center mobile-center-center" 
    data-src="/content/dam/Image.jpg.transform/default- 
mobile/image.jpg" 
    data-mobile-rendition="/content/dam/Image.jpg.transform/default-mobile/image.jpg" 
    data-tablet-rendition="/content/dam/Image.jpg.transform/default-mobile/image.jpg" 
    data-desktop- rendition="/content/dam/Image.jpg.transform/default-desktop/image.jpg" 
    style="background-image: url(&quot;/content/dam/Image.jpg.transform/default- 
mobile/image.jpg&quot;);">
</div>

3 个答案:

答案 0 :(得分:0)

远离优雅或简单的解决方案,但希望可以为您提供一些开始:

    String snippet =
      "<div class=\"rendition-bg rendition-bg--alignment desktop-center-center" +
        "mobile-center-center \" data-src=\"/content/dam/Image.jpg.transform/default-" +
        "mobile/image.jpg\" data-mobile- \n" +
        "rendition=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\" data-" +
        "tablet-rendition=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\"" +
        "data-desktop- rendition=\"/content/dam/Image.jpg.transform/default-desktop/image.jpg\"" +
        "style=\"background-image: url(&quot;/content/dam/Image.jpg.transform/default-" +
        "mobile/image.jpg&quot;);\"></div>";

    List<String> imgAttrs =
      Jsoup.parse(snippet)
        .getElementsByTag("div")
        .stream()
        // get lists of attributes
        .map(Element::attributes)
        // flatten all attrs to single list
        .flatMap(attrs -> attrs.asList().stream())
        // filter attributes
        .filter(attribute -> attribute.getValue() != null && attribute.getValue().contains(".jpg"))
        // map to values
        .map(Attribute::getValue)
        // replace all ".transform" with a whitespace
        .map(attrValue -> attrValue.replace(".transform", " "))
        // get url value of a "background-image"
        .map(attrValue -> getUrlFromBackgroundImage(attrValue))
        // split attributes by whitespaces
        .flatMap(attrValue -> Stream.of(attrValue.split(" ")))
        .collect(toList());
      }

     private static String getUrlFromBackgroundImage(final String backgroundImage) {
        Pattern pattern = Pattern.compile("background-image:[ ]?url\\((['\"]?(.*?\\.(?:png|jpg|jpeg|gif)(\\s)?)*)");
        Matcher matcher = pattern.matcher(backgroundImage);
        return matcher.find() ? matcher.group(1) : backgroundImage;
     }

imgAttrs的内容应为:

/content/dam/Image.jpg
/default-mobile/image.jpg
/content/dam/Image.jpg
/default-desktop/image.jpg
/content/dam/Image.jpg
/default-mobile/image.jpg
"/content/dam/Image.jpg
/default-mobile/image.jpg

不确定那是否是您所需要的。

答案 1 :(得分:0)

注释说明:

    Document doc = Jsoup.parse(
        "<div class=\"rendition-bg rendition-bg--alignment desktop-center-center mobile-center-center \" "
        + "data-src=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\" "
        + "data-mobile-rendition=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\" "
        + "data-tablet-rendition=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\" "
        + "data-desktop-rendition=\"/content/dam/Image.jpg.transform/default-desktop/image.jpg\" "
        + "style=\"background-image: url(&quot;/content/dam/Image.jpg.transform/default-mobile/image.jpg&quot;);\"></div>");

    // select all elements with "data-src" attribute, but here we use only the first of them
    Map<String, String> dataAttributes = doc.select("[data-src]").first().dataset();

    // here we have all data attributes of this element:
    System.out.println(dataAttributes);

    // you can access them like this:
    System.out.println(dataAttributes.get("mobile-rendition"));
    System.out.println(dataAttributes.get("tablet-rendition"));
    System.out.println(dataAttributes.get("desktop-rendition"));

    // split and create list of urls (contains duplicates)
    List<String> urls = dataAttributes.entrySet().stream().flatMap(e -> Stream.of(e.getValue().split("\\.transform")))
                .collect(Collectors.toList());

    // if you need only unique urls use this one instead:
    //  Set<String> urls = dataAttributes.entrySet().stream().flatMap(e -> Stream.of(e.getValue().split(".transform"))).collect(Collectors.toSet());
    System.out.println(urls);

答案 2 :(得分:0)

仔细观察div,我们可以看到引用了2张图像。他们是

data-src=                  "/content/dam/Image.jpg.transform/default-mobile/image.jpg" 
data-mobile-rendition=     "/content/dam/Image.jpg.transform/default-mobile/image.jpg" 
data-tablet-rendition=     "/content/dam/Image.jpg.transform/default-mobile/image.jpg" 
data-desktop- rendition=   "/content/dam/Image.jpg.transform/default-desktop/image.jpg" 
style="background-image: url/content/dam/Image.jpg.transform/default-mobile/image.jpg

在这四个图像参考中,有3个是指同一图像,而另一个是指桌面〜图像。因此,如果我们需要为这两张图片提取URL:

data-src=                  "/content/dam/Image.jpg.transform/default-mobile/image.jpg" 
data-desktop- rendition=   "/content/dam/Image.jpg.transform/default-desktop/image.jpg"

我们可以使用以下代码:

        Elements els = doc.select("div.rendition-bg");
        for (Element ele :els){
                System.out.println(ele.absUrl("data-src"));
                System.out.println(ele.absUrl("data-desktop-"));                
            }

让我知道我是否正确理解了您的要求。