jsoup获取与其相关的特定标签和值

时间:2016-11-28 21:13:41

标签: java regex jsoup

我是jsoup的新手,想要更熟悉如何从网站中提取信息。我想做一些简单的事情:从eBay获取一些价值。

我希望获得项目名称,html链接,价格和销售量来自"本周热销" (比如这里:http://www.ebay.co.uk/sch/Action-Figures/246/bn_1632128/i.html

但我不确定如何继续。

package application;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;

import javax.swing.JOptionPane;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class GetHotSellers {

    public static void main(String[] args) {
        Document doc =  Jsoup.parse(readURL("http://www.ebay.co.uk/sch/Action-Figures/246/bn_1632128/i.html"));

        Elements sold_items = doc.getElementsMatchingText("sold$");   
        for(Element sold : sold_items) {
                System.out.println(sold.text());
        }
    }


     public static String readURL(String url) {

     String fileContents = "";
     String currentLine = "";

     try {
         BufferedReader reader = new BufferedReader(new InputStreamReader(new URL(url).openStream()));
         fileContents = reader.readLine();
         while (currentLine != null) {
             currentLine = reader.readLine();
             fileContents += "\n" + currentLine;
         }
         reader.close();
         reader = null;
     } catch (Exception e) {
         JOptionPane.showMessageDialog(null, e.getMessage(), "Error Message", JOptionPane.OK_OPTION);
         e.printStackTrace();

     }

     return fileContents;
    }

}

这是我得到的。我是否需要改进我的正则表达式,还是需要使用更适合我的请求的其他功能?

我目前的输出如下:

2016 8PC Marvel Avengers DC Super Hero Mini Figure Set Fits Lego FROM UK £6.35 381 sold Despicable Me Minions Supervillain Jet Playset -From the Argos Shop on ebay £7.99 187 sold Avengers Marvel Titan 12" figure Spider-man Captain Iron man Wolverine Thor Toy £8.69 174 sold Imaginext Marvel DC Super Hero Squad Figures and Villains Batman Please select £1.99 129 sold Star Wars Episode The Force Awakens Electronic Chewbacca Mask IN STOCK NOW! £24.99 101 sold Jurassic World Indominus Rex Chomping Dinosaur 44cm Figure T-Rex Dino Action Toy £26.99 89 sold 12" Avengers Marvel Titan Figures Spider-Man Captain Iron Man Wolverine Thor Toy £7.45 88 sold Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay £7.99 87 sold
2016 8PC Marvel Avengers DC Super Hero Mini Figure Set Fits Lego FROM UK £6.35 381 sold Despicable Me Minions Supervillain Jet Playset -From the Argos Shop on ebay £7.99 187 sold Avengers Marvel Titan 12" figure Spider-man Captain Iron man Wolverine Thor Toy £8.69 174 sold Imaginext Marvel DC Super Hero Squad Figures and Villains Batman Please select £1.99 129 sold Star Wars Episode The Force Awakens Electronic Chewbacca Mask IN STOCK NOW! £24.99 101 sold Jurassic World Indominus Rex Chomping Dinosaur 44cm Figure T-Rex Dino Action Toy £26.99 89 sold 12" Avengers Marvel Titan Figures Spider-Man Captain Iron Man Wolverine Thor Toy £7.45 88 sold Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay £7.99 87 sold
2016 8PC Marvel Avengers DC Super Hero Mini Figure Set Fits Lego FROM UK £6.35 381 sold
2016 8PC Marvel Avengers DC Super Hero Mini Figure Set Fits Lego FROM UK £6.35 381 sold
2016 8PC Marvel Avengers DC Super Hero Mini Figure Set Fits Lego FROM UK £6.35 381 sold
381 sold
381 sold
Despicable Me Minions Supervillain Jet Playset -From the Argos Shop on ebay £7.99 187 sold
Despicable Me Minions Supervillain Jet Playset -From the Argos Shop on ebay £7.99 187 sold
Despicable Me Minions Supervillain Jet Playset -From the Argos Shop on ebay £7.99 187 sold
187 sold
187 sold
Avengers Marvel Titan 12" figure Spider-man Captain Iron man Wolverine Thor Toy £8.69 174 sold
Avengers Marvel Titan 12" figure Spider-man Captain Iron man Wolverine Thor Toy £8.69 174 sold
Avengers Marvel Titan 12" figure Spider-man Captain Iron man Wolverine Thor Toy £8.69 174 sold
174 sold
174 sold
Imaginext Marvel DC Super Hero Squad Figures and Villains Batman Please select £1.99 129 sold
Imaginext Marvel DC Super Hero Squad Figures and Villains Batman Please select £1.99 129 sold
Imaginext Marvel DC Super Hero Squad Figures and Villains Batman Please select £1.99 129 sold
129 sold
129 sold
Star Wars Episode The Force Awakens Electronic Chewbacca Mask IN STOCK NOW! £24.99 101 sold
Star Wars Episode The Force Awakens Electronic Chewbacca Mask IN STOCK NOW! £24.99 101 sold
Star Wars Episode The Force Awakens Electronic Chewbacca Mask IN STOCK NOW! £24.99 101 sold
101 sold
101 sold
Jurassic World Indominus Rex Chomping Dinosaur 44cm Figure T-Rex Dino Action Toy £26.99 89 sold
Jurassic World Indominus Rex Chomping Dinosaur 44cm Figure T-Rex Dino Action Toy £26.99 89 sold
Jurassic World Indominus Rex Chomping Dinosaur 44cm Figure T-Rex Dino Action Toy £26.99 89 sold
89 sold
89 sold
12" Avengers Marvel Titan Figures Spider-Man Captain Iron Man Wolverine Thor Toy £7.45 88 sold
12" Avengers Marvel Titan Figures Spider-Man Captain Iron Man Wolverine Thor Toy £7.45 88 sold
12" Avengers Marvel Titan Figures Spider-Man Captain Iron Man Wolverine Thor Toy £7.45 88 sold
88 sold
88 sold
Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay £7.99 87 sold
Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay £7.99 87 sold
Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay £7.99 87 sold
87 sold
87 sold

我想要的输出示例:

Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay || £7.99 || 87 sold || http://link.com

编辑:

试过这样的事,但没有运气。

for(String categoryURL : categoryLinksArray) {
    Document doc = Jsoup.parse(readURL(categoryURL));
    Elements sold_items = doc.getElementsByClass("b-block-info-container");
    for(Element sold : sold_items) {
            System.out.println("NAME: " + sold.attr("b-block-info-container__title b-block-info-container__title__ListingSummary") + "\n" + 
                               "PRICE: " + sold.attr("b-block-info-container__price") + "\n" +
                               "SOLD/week: " + sold.attr("item_quantity__hotness") + "\n" +
                               "URL: " + sold.attr("abs:href"));
            System.out.println("--------------------------------------");
    }
}

3 个答案:

答案 0 :(得分:1)

我做到了,但效率不高,因为它很慢。

public static void main(String[] args) {

    ArrayList<String> categoryLinksArray = new ArrayList<>();

    Document links = Jsoup.parse(readURL("http://www.ebay.co.uk/sch/allcategories/all-categories"));
    Elements item_categories = links.getElementsByClass("ch");
    for (Element category : item_categories) {
        categoryLinksArray.add(category.attr("abs:href"));
    }

    for (String categoryURL : categoryLinksArray) {
        Document doc = Jsoup.parse(readURL(categoryURL));
        Elements hot_items = doc
                .getElementsByClass("b-module b-module-carousel b-module-deals topSold b-display--portrait");
        for (Element item : hot_items) {

            Elements hot_items_names = item.getElementsByClass(
                    "b-block-info-container__title b-block-info-container__title__ListingSummary");
            Elements hot_items_price = item.getElementsByClass("b-block-info-container__price");
            Elements hot_items_sold = item.getElementsByClass("item_quantity__hotness");
            Elements hot_items_url = item.getElementsByClass("b-block-tile");

            HashMap<String, String> hs_items = new HashMap<>();

            for (Element item_name : hot_items_names) {
                hs_items.put("Name", item_name.text());
            }
            for (Element item_price : hot_items_price) {
                hs_items.put("Price", item_price.text());
            }
            for (Element item_sold : hot_items_sold) {
                hs_items.put("Sold", item_sold.text());
            }
            for (Element item_url : hot_items_url) {
                hs_items.put("URL", item_url.attr("abs:href"));
            }

            System.out.println("Name: " + hs_items.get("Name") + "\n" +
                               "Price: " + hs_items.get("Price") + "\n" +
                               "Sold: " + hs_items.get("Sold") + "\n" +
                               "URL: " + hs_items.get("URL") + "\n" +
                               "----------------------------------");
        }
    }
}

答案 1 :(得分:0)

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTest {
    public static void main(String argv[]) throws IOException {            
        Document doc = Jsoup.connect("http://www.ebay.co.uk/sch/Action-Figures/246/bn_1632128/i.html").get(); //connect to url and get the document
        Element hotThisWeek = doc.getElementById("w6-2-x-carousel-items"); // select the div by its ID // better than matching text because id is unique
        Elements items = hotThisWeek.select("li");    // select all li tags        
        for(Element e : items){
            System.out.println(  e.select("div.b-block-info-container__title").text() // select the div with title text by class name
                     + " || " +  e.select("div.b-block-info-container__price").text()  // select the price-div by its class name
                     + " || " +  e.select("div.item_quantity__hotness").text()  // select hotness-div by class name 
                     + " || " +  e.select("a").attr("href")); //select a tag and get value of attribute href 
        }
    } 
}

答案 2 :(得分:0)

页面按部分组织。那些节标签的每一个都是id,以id =“w2”开头,id =“w3”......直到id =“w10”。您可以使用它来遍历每个部分并选择您所关注的数据。示例:

<?xml version="1.0" encoding="utf-8"?>
<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
    android:layout_width="match_parent"
    android:layout_height="match_parent"
    android:background="@drawable/bg">
    <RelativeLayout
        android:layout_width="match_parent"
        android:layout_height="match_parent">
        <LinearLayout
            android:layout_width="match_parent"
            android:layout_height="wrap_content"
            android:gravity="center"
            android:layout_centerInParent="true"
            android:orientation="vertical">
            <ImageView
                android:id="@+id/seat_icon"
                android:layout_width="wrap_content"
                android:layout_height="wrap_content"
                android:background="@mipmap/ic_launcher"
                android:layout_marginBottom="20dp" />
            <TextView
                android:id="@+id/seat_number"
                android:layout_width="match_parent"
                android:layout_height="wrap_content"
                android:text="Test1"
                android:textAllCaps="false"
                android:textSize="16dp"
                android:gravity="center"/>
            <TextView
                android:id="@+id/another_text"
                android:layout_width="match_parent"
                android:layout_height="wrap_content"
                android:paddingTop="@dimen/dp_size10"
                android:paddingBottom="@dimen/dp_size10"
                android:text="TEst 2"
                android:textAllCaps="false"
                android:textColor="@color/white"
                android:textSize="16dp"
                android:gravity="center"/>

        </LinearLayout>
        <RelativeLayout
            android:layout_width="match_parent"
            android:layout_height="wrap_content"
            android:layout_alignParentBottom="true"
            android:layout_marginBottom="@dimen/dp_size10">

            <RelativeLayout
                android:layout_width="match_parent"
                android:layout_height="wrap_content"
                android:weightSum="1"
                android:layout_alignParentBottom="true"
                android:layout_alignParentStart="true">

                <ImageView
                    android:id="@+id/on_boarding_circle_arrow"
                    android:layout_width="wrap_content"
                    android:layout_height="wrap_content"
                    android:layout_gravity="center"
                    android:layout_centerInParent="true"
                    android:background="@mipmap/ic_launcher" />
            </RelativeLayout>
        </RelativeLayout>
    </RelativeLayout>
</LinearLayout>