我有一个网站,我需要从中解析数据。我需要通过关键字结果进行一些搜索。但是,并非所有字段都在产品预览中可见。似乎这些字段(产品颜色,描述,旧价格)只能从每个产品页面中删除。产品页面的网址如下所示https://www.aboutyou.de/p/new-look/basecap-in-satin-optik-3649077 SI不知道如何以通用方式调用它,因此我不必浏览每个产品。我可以找到项目的名称和品牌,但我不知道如何构建网址 - 将所有字母设置为大写并在字词之间加上破折号? 我可以通过以下方式获得品牌名称和产品名称:Satin-Optik中的新LOOK Basecap。
那我怎么定义每个产品的网址呢?
这是我到目前为止的代码:
String url = "https://www.aboutyou.de/frauen/accessoires/huete-und-muetzen/caps";
Document doc = Jsoup.connect(url).get();
System.out.println("Title: " + doc.title());
String mainPath = "section.layout_11glwo1-o_O-stretchLayout_1jug6qr > " +
"div.content_1jug6qr > " +
"div.container > " +
"div.mainContent_10ejhcu > " +
"div.productStream_6k751k > " +
"div > " +
"div.wrapper_8yay2a > " +
"div.col-sm-6.col-md-4 > " +
"div.wrapper_1eu800j > " +
"div > " +
"div.categoryTileWrapper_e296pg";
String searchPath = mainPath + " > a.anchor_wgmchy > " +
"div.details_197iil9 > " +
"div.meta_1ihynio";
String linksPath = mainPath + " > a.anchor_wgmchy";
String brandPath = mainPath + " > a.anchor_wgmchy > " +
"div.details_197iil9 > " +
"div.meta_1ihynio > " +
"div.description_ya0ltb > " +
"strong.brand_ke66rm";
Elements result = doc.body().select("main#app");
for(Element element : result) {
Elements products = element.select(searchPath);
Elements links = element.select(linksPath);
Elements brands = element.select(brandPath);
for(Element product : products){
System.out.println(product.text());
}
String[] linksText = null;
for(Element link : links){
String linkHref = link.attr("href");
String linkText = link.text();
linksText = linkHref.split("[\\-]");
String id = linksText[linksText.length-1];
System.out.println("id: " + id);
System.out.print("link attr:" + linkHref + ", ");
}
System.out.print("\nbrands" + brands.text());
}
也许,有一些图书馆吗?我会很感激任何建议!
答案 0 :(得分:0)
大部分所需细节都可以从div中获取,如下所示:
<div class="details_..." ...>
抓住这些div的文本会给你类似的东西:
-10%9,90€ -10 % EXTRA8,90€ NEW LOOK Basecap in Satin-Optik 8,01€
示例代码,从产品页面中分离了一些细节和颜色细节的子请求:
String url = "https://www.aboutyou.de/frauen/accessoires/huete-und-muetzen/caps";
String userAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36";
try {
Document doc = Jsoup.connect(url).userAgent(userAgent).get();
Elements elements = doc.select("div[class^='categoryTileWrapper_']");
for (Element element : elements) {
String brand = element.select("strong[class^='brand_']").first().text();
String name = element.select("p[class^='name_']").first().text();
System.out.println(brand + " - " + name);
String href = element.select("a[class^='anchor_']").first().absUrl("href");
Document subDoc = Jsoup.connect(href).userAgent(userAgent).get();
String color = subDoc.select("div[class^='attributeWrapper_']").first().text();
System.out.println("\t"+href);
System.out.println("\t"+color);
String finalPrice = element.select("div[class^='finalPrice_']").first().text();
if( element.select("ul").size()>0 ){
for (Element listItems : element.select("ul").first().select("li")) {
System.out.println("\tpriece was: " + listItems.select("span[class^='price_']").first().text());
}
}
System.out.println("\tfinal priece: " + finalPrice);
}
} catch (IOException e) {
e.printStackTrace();
}
输出:
NEW LOOK - Basecap in Satin-Optik
https://www.aboutyou.de/p/new-look/basecap-in-satin-optik-3649077
Textil Unifarben
priece was: 9,90€
priece was: 8,90€
final priece: 8,01€
WOOD WOOD - Weiche 'Baseball cap'
https://www.aboutyou.de/p/wood-wood/weiche-baseball-cap-3687779
Logoprint
priece was: 39,90€
priece was: 29,90€
final priece: 20,93€
[... truncated]