我需要提取div内的图像,并且src不在img标签内。我不能执行getElementById(),因为它在页面之间有所不同。在这种情况下,我可以使用一些正则表达式从doc中提取图像吗?任何帮助表示赞赏。
HTML片段:
<div
class="rendition-bg rendition-bg--alignment desktop-center-center mobile-center-center"
data-src="/content/dam/Image.jpg.transform/default-
mobile/image.jpg"
data-mobile-rendition="/content/dam/Image.jpg.transform/default-mobile/image.jpg"
data-tablet-rendition="/content/dam/Image.jpg.transform/default-mobile/image.jpg"
data-desktop- rendition="/content/dam/Image.jpg.transform/default-desktop/image.jpg"
style="background-image: url("/content/dam/Image.jpg.transform/default-
mobile/image.jpg");">
</div>
答案 0 :(得分:0)
远离优雅或简单的解决方案,但希望可以为您提供一些开始:
String snippet =
"<div class=\"rendition-bg rendition-bg--alignment desktop-center-center" +
"mobile-center-center \" data-src=\"/content/dam/Image.jpg.transform/default-" +
"mobile/image.jpg\" data-mobile- \n" +
"rendition=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\" data-" +
"tablet-rendition=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\"" +
"data-desktop- rendition=\"/content/dam/Image.jpg.transform/default-desktop/image.jpg\"" +
"style=\"background-image: url("/content/dam/Image.jpg.transform/default-" +
"mobile/image.jpg");\"></div>";
List<String> imgAttrs =
Jsoup.parse(snippet)
.getElementsByTag("div")
.stream()
// get lists of attributes
.map(Element::attributes)
// flatten all attrs to single list
.flatMap(attrs -> attrs.asList().stream())
// filter attributes
.filter(attribute -> attribute.getValue() != null && attribute.getValue().contains(".jpg"))
// map to values
.map(Attribute::getValue)
// replace all ".transform" with a whitespace
.map(attrValue -> attrValue.replace(".transform", " "))
// get url value of a "background-image"
.map(attrValue -> getUrlFromBackgroundImage(attrValue))
// split attributes by whitespaces
.flatMap(attrValue -> Stream.of(attrValue.split(" ")))
.collect(toList());
}
private static String getUrlFromBackgroundImage(final String backgroundImage) {
Pattern pattern = Pattern.compile("background-image:[ ]?url\\((['\"]?(.*?\\.(?:png|jpg|jpeg|gif)(\\s)?)*)");
Matcher matcher = pattern.matcher(backgroundImage);
return matcher.find() ? matcher.group(1) : backgroundImage;
}
imgAttrs的内容应为:
/content/dam/Image.jpg
/default-mobile/image.jpg
/content/dam/Image.jpg
/default-desktop/image.jpg
/content/dam/Image.jpg
/default-mobile/image.jpg
"/content/dam/Image.jpg
/default-mobile/image.jpg
不确定那是否是您所需要的。
答案 1 :(得分:0)
注释说明:
Document doc = Jsoup.parse(
"<div class=\"rendition-bg rendition-bg--alignment desktop-center-center mobile-center-center \" "
+ "data-src=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\" "
+ "data-mobile-rendition=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\" "
+ "data-tablet-rendition=\"/content/dam/Image.jpg.transform/default-mobile/image.jpg\" "
+ "data-desktop-rendition=\"/content/dam/Image.jpg.transform/default-desktop/image.jpg\" "
+ "style=\"background-image: url("/content/dam/Image.jpg.transform/default-mobile/image.jpg");\"></div>");
// select all elements with "data-src" attribute, but here we use only the first of them
Map<String, String> dataAttributes = doc.select("[data-src]").first().dataset();
// here we have all data attributes of this element:
System.out.println(dataAttributes);
// you can access them like this:
System.out.println(dataAttributes.get("mobile-rendition"));
System.out.println(dataAttributes.get("tablet-rendition"));
System.out.println(dataAttributes.get("desktop-rendition"));
// split and create list of urls (contains duplicates)
List<String> urls = dataAttributes.entrySet().stream().flatMap(e -> Stream.of(e.getValue().split("\\.transform")))
.collect(Collectors.toList());
// if you need only unique urls use this one instead:
// Set<String> urls = dataAttributes.entrySet().stream().flatMap(e -> Stream.of(e.getValue().split(".transform"))).collect(Collectors.toSet());
System.out.println(urls);
答案 2 :(得分:0)
仔细观察div,我们可以看到引用了2张图像。他们是
data-src= "/content/dam/Image.jpg.transform/default-mobile/image.jpg"
data-mobile-rendition= "/content/dam/Image.jpg.transform/default-mobile/image.jpg"
data-tablet-rendition= "/content/dam/Image.jpg.transform/default-mobile/image.jpg"
data-desktop- rendition= "/content/dam/Image.jpg.transform/default-desktop/image.jpg"
style="background-image: url/content/dam/Image.jpg.transform/default-mobile/image.jpg
在这四个图像参考中,有3个是指同一图像,而另一个是指桌面〜图像。因此,如果我们需要为这两张图片提取URL:
data-src= "/content/dam/Image.jpg.transform/default-mobile/image.jpg"
data-desktop- rendition= "/content/dam/Image.jpg.transform/default-desktop/image.jpg"
我们可以使用以下代码:
Elements els = doc.select("div.rendition-bg");
for (Element ele :els){
System.out.println(ele.absUrl("data-src"));
System.out.println(ele.absUrl("data-desktop-"));
}
让我知道我是否正确理解了您的要求。