我是jsoup的新手,想要更熟悉如何从网站中提取信息。我想做一些简单的事情:从eBay获取一些价值。
我希望获得项目名称,html链接,价格和销售量来自"本周热销" (比如这里:http://www.ebay.co.uk/sch/Action-Figures/246/bn_1632128/i.html)
但我不确定如何继续。
package application;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import javax.swing.JOptionPane;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class GetHotSellers {
public static void main(String[] args) {
Document doc = Jsoup.parse(readURL("http://www.ebay.co.uk/sch/Action-Figures/246/bn_1632128/i.html"));
Elements sold_items = doc.getElementsMatchingText("sold$");
for(Element sold : sold_items) {
System.out.println(sold.text());
}
}
public static String readURL(String url) {
String fileContents = "";
String currentLine = "";
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new URL(url).openStream()));
fileContents = reader.readLine();
while (currentLine != null) {
currentLine = reader.readLine();
fileContents += "\n" + currentLine;
}
reader.close();
reader = null;
} catch (Exception e) {
JOptionPane.showMessageDialog(null, e.getMessage(), "Error Message", JOptionPane.OK_OPTION);
e.printStackTrace();
}
return fileContents;
}
}
这是我得到的。我是否需要改进我的正则表达式,还是需要使用更适合我的请求的其他功能?
我目前的输出如下:
2016 8PC Marvel Avengers DC Super Hero Mini Figure Set Fits Lego FROM UK £6.35 381 sold Despicable Me Minions Supervillain Jet Playset -From the Argos Shop on ebay £7.99 187 sold Avengers Marvel Titan 12" figure Spider-man Captain Iron man Wolverine Thor Toy £8.69 174 sold Imaginext Marvel DC Super Hero Squad Figures and Villains Batman Please select £1.99 129 sold Star Wars Episode The Force Awakens Electronic Chewbacca Mask IN STOCK NOW! £24.99 101 sold Jurassic World Indominus Rex Chomping Dinosaur 44cm Figure T-Rex Dino Action Toy £26.99 89 sold 12" Avengers Marvel Titan Figures Spider-Man Captain Iron Man Wolverine Thor Toy £7.45 88 sold Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay £7.99 87 sold
2016 8PC Marvel Avengers DC Super Hero Mini Figure Set Fits Lego FROM UK £6.35 381 sold Despicable Me Minions Supervillain Jet Playset -From the Argos Shop on ebay £7.99 187 sold Avengers Marvel Titan 12" figure Spider-man Captain Iron man Wolverine Thor Toy £8.69 174 sold Imaginext Marvel DC Super Hero Squad Figures and Villains Batman Please select £1.99 129 sold Star Wars Episode The Force Awakens Electronic Chewbacca Mask IN STOCK NOW! £24.99 101 sold Jurassic World Indominus Rex Chomping Dinosaur 44cm Figure T-Rex Dino Action Toy £26.99 89 sold 12" Avengers Marvel Titan Figures Spider-Man Captain Iron Man Wolverine Thor Toy £7.45 88 sold Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay £7.99 87 sold
2016 8PC Marvel Avengers DC Super Hero Mini Figure Set Fits Lego FROM UK £6.35 381 sold
2016 8PC Marvel Avengers DC Super Hero Mini Figure Set Fits Lego FROM UK £6.35 381 sold
2016 8PC Marvel Avengers DC Super Hero Mini Figure Set Fits Lego FROM UK £6.35 381 sold
381 sold
381 sold
Despicable Me Minions Supervillain Jet Playset -From the Argos Shop on ebay £7.99 187 sold
Despicable Me Minions Supervillain Jet Playset -From the Argos Shop on ebay £7.99 187 sold
Despicable Me Minions Supervillain Jet Playset -From the Argos Shop on ebay £7.99 187 sold
187 sold
187 sold
Avengers Marvel Titan 12" figure Spider-man Captain Iron man Wolverine Thor Toy £8.69 174 sold
Avengers Marvel Titan 12" figure Spider-man Captain Iron man Wolverine Thor Toy £8.69 174 sold
Avengers Marvel Titan 12" figure Spider-man Captain Iron man Wolverine Thor Toy £8.69 174 sold
174 sold
174 sold
Imaginext Marvel DC Super Hero Squad Figures and Villains Batman Please select £1.99 129 sold
Imaginext Marvel DC Super Hero Squad Figures and Villains Batman Please select £1.99 129 sold
Imaginext Marvel DC Super Hero Squad Figures and Villains Batman Please select £1.99 129 sold
129 sold
129 sold
Star Wars Episode The Force Awakens Electronic Chewbacca Mask IN STOCK NOW! £24.99 101 sold
Star Wars Episode The Force Awakens Electronic Chewbacca Mask IN STOCK NOW! £24.99 101 sold
Star Wars Episode The Force Awakens Electronic Chewbacca Mask IN STOCK NOW! £24.99 101 sold
101 sold
101 sold
Jurassic World Indominus Rex Chomping Dinosaur 44cm Figure T-Rex Dino Action Toy £26.99 89 sold
Jurassic World Indominus Rex Chomping Dinosaur 44cm Figure T-Rex Dino Action Toy £26.99 89 sold
Jurassic World Indominus Rex Chomping Dinosaur 44cm Figure T-Rex Dino Action Toy £26.99 89 sold
89 sold
89 sold
12" Avengers Marvel Titan Figures Spider-Man Captain Iron Man Wolverine Thor Toy £7.45 88 sold
12" Avengers Marvel Titan Figures Spider-Man Captain Iron Man Wolverine Thor Toy £7.45 88 sold
12" Avengers Marvel Titan Figures Spider-Man Captain Iron Man Wolverine Thor Toy £7.45 88 sold
88 sold
88 sold
Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay £7.99 87 sold
Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay £7.99 87 sold
Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay £7.99 87 sold
87 sold
87 sold
我想要的输出示例:
Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay || £7.99 || 87 sold || http://link.com
编辑:
试过这样的事,但没有运气。
for(String categoryURL : categoryLinksArray) {
Document doc = Jsoup.parse(readURL(categoryURL));
Elements sold_items = doc.getElementsByClass("b-block-info-container");
for(Element sold : sold_items) {
System.out.println("NAME: " + sold.attr("b-block-info-container__title b-block-info-container__title__ListingSummary") + "\n" +
"PRICE: " + sold.attr("b-block-info-container__price") + "\n" +
"SOLD/week: " + sold.attr("item_quantity__hotness") + "\n" +
"URL: " + sold.attr("abs:href"));
System.out.println("--------------------------------------");
}
}
答案 0 :(得分:1)
我做到了,但效率不高,因为它很慢。
public static void main(String[] args) {
ArrayList<String> categoryLinksArray = new ArrayList<>();
Document links = Jsoup.parse(readURL("http://www.ebay.co.uk/sch/allcategories/all-categories"));
Elements item_categories = links.getElementsByClass("ch");
for (Element category : item_categories) {
categoryLinksArray.add(category.attr("abs:href"));
}
for (String categoryURL : categoryLinksArray) {
Document doc = Jsoup.parse(readURL(categoryURL));
Elements hot_items = doc
.getElementsByClass("b-module b-module-carousel b-module-deals topSold b-display--portrait");
for (Element item : hot_items) {
Elements hot_items_names = item.getElementsByClass(
"b-block-info-container__title b-block-info-container__title__ListingSummary");
Elements hot_items_price = item.getElementsByClass("b-block-info-container__price");
Elements hot_items_sold = item.getElementsByClass("item_quantity__hotness");
Elements hot_items_url = item.getElementsByClass("b-block-tile");
HashMap<String, String> hs_items = new HashMap<>();
for (Element item_name : hot_items_names) {
hs_items.put("Name", item_name.text());
}
for (Element item_price : hot_items_price) {
hs_items.put("Price", item_price.text());
}
for (Element item_sold : hot_items_sold) {
hs_items.put("Sold", item_sold.text());
}
for (Element item_url : hot_items_url) {
hs_items.put("URL", item_url.attr("abs:href"));
}
System.out.println("Name: " + hs_items.get("Name") + "\n" +
"Price: " + hs_items.get("Price") + "\n" +
"Sold: " + hs_items.get("Sold") + "\n" +
"URL: " + hs_items.get("URL") + "\n" +
"----------------------------------");
}
}
}
答案 1 :(得分:0)
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupTest {
public static void main(String argv[]) throws IOException {
Document doc = Jsoup.connect("http://www.ebay.co.uk/sch/Action-Figures/246/bn_1632128/i.html").get(); //connect to url and get the document
Element hotThisWeek = doc.getElementById("w6-2-x-carousel-items"); // select the div by its ID // better than matching text because id is unique
Elements items = hotThisWeek.select("li"); // select all li tags
for(Element e : items){
System.out.println( e.select("div.b-block-info-container__title").text() // select the div with title text by class name
+ " || " + e.select("div.b-block-info-container__price").text() // select the price-div by its class name
+ " || " + e.select("div.item_quantity__hotness").text() // select hotness-div by class name
+ " || " + e.select("a").attr("href")); //select a tag and get value of attribute href
}
}
}
答案 2 :(得分:0)
页面按部分组织。那些节标签的每一个都是id,以id =“w2”开头,id =“w3”......直到id =“w10”。您可以使用它来遍历每个部分并选择您所关注的数据。示例:
<?xml version="1.0" encoding="utf-8"?>
<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
android:layout_width="match_parent"
android:layout_height="match_parent"
android:background="@drawable/bg">
<RelativeLayout
android:layout_width="match_parent"
android:layout_height="match_parent">
<LinearLayout
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:gravity="center"
android:layout_centerInParent="true"
android:orientation="vertical">
<ImageView
android:id="@+id/seat_icon"
android:layout_width="wrap_content"
android:layout_height="wrap_content"
android:background="@mipmap/ic_launcher"
android:layout_marginBottom="20dp" />
<TextView
android:id="@+id/seat_number"
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:text="Test1"
android:textAllCaps="false"
android:textSize="16dp"
android:gravity="center"/>
<TextView
android:id="@+id/another_text"
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:paddingTop="@dimen/dp_size10"
android:paddingBottom="@dimen/dp_size10"
android:text="TEst 2"
android:textAllCaps="false"
android:textColor="@color/white"
android:textSize="16dp"
android:gravity="center"/>
</LinearLayout>
<RelativeLayout
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:layout_alignParentBottom="true"
android:layout_marginBottom="@dimen/dp_size10">
<RelativeLayout
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:weightSum="1"
android:layout_alignParentBottom="true"
android:layout_alignParentStart="true">
<ImageView
android:id="@+id/on_boarding_circle_arrow"
android:layout_width="wrap_content"
android:layout_height="wrap_content"
android:layout_gravity="center"
android:layout_centerInParent="true"
android:background="@mipmap/ic_launcher" />
</RelativeLayout>
</RelativeLayout>
</RelativeLayout>
</LinearLayout>