如何使用Jsoup从网站上的链接中检索网址?

时间:2016-11-23 16:26:05

标签: java html url jsoup yelp

好吧,所以我完成了我的Yelp扫描仪,一切都运行良好。我现在要做的是让程序检索每个企业的每个链接的URL,转到该页面,并扫描它是否包含:



xlink:href="#30x30_bullhorn"></use>
&#13;
&#13;
&#13;

我非常清楚我将如何去做,但是,我似乎无法找到一个可以检索链接网址的jSoup方法。页面的HTML中是否存在具有该URL的HTML?我根本不熟悉HTML,因此我所看到的90%都是喋喋不休。这是一个示例链接,如果您想查看我所指的内容。

https://www.yelp.com/search?find_loc=nj&start=10是主页面,我需要获取页面https://www.yelp.com/biz/la-cocina-newark的网址。橙色扩音器是我试图让它取回它。这是我的代码btw:

import java.util.ArrayList;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.Scanner;

public class YelpScrapper
{
    public static void main(String[] args) throws IOException, Exception
    {        
        //Variables
        String description;
        String location;
        int pages;
        int parseCount = 0;
        Document document;

        Scanner keyboard = new Scanner(System.in);

        //Perform a Search
        System.out.print("Enter a description: ");
        description = keyboard.nextLine();

        System.out.print("Enter a state: ");
        location = keyboard.nextLine();

        System.out.print("How many pages should we scan? ");
        pages = keyboard.nextInt();

        String descString = "find_desc=" + description.replace(' ', '+') + "&";
        String locString = "find_loc=" + location.replace(' ', '+') + "&";
        int number = 0;

        String url = "https://www.yelp.com/search?" + descString + locString + "start=" + number;
        ArrayList<String> names = new ArrayList<String>();
        ArrayList<String> address = new ArrayList<String>();
        ArrayList<String> phone = new ArrayList<String>();

        //Fetch Data From Yelp
        for (int i = 0 ; i <= pages ; i++)
        {

            document = Jsoup.connect(url).get();

            Elements nameElements = document.select(".indexed-biz-name span");
            Elements addressElements = document.select(".secondary-attributes address");
            Elements phoneElements = document.select(".biz-phone");

            for (Element element : nameElements)
            {
                names.add(element.text());
            }

            for (Element element : addressElements)
            {
                address.add(element.text());
            }

            for (Element element : phoneElements)
            {
                phone.add(element.text());
            }

            for (int index = 0 ; index < 10 ; index++)
            {
                System.out.println("\nLead " + parseCount);
                System.out.println("Company Name: " + names.get(parseCount));
                System.out.println("Address: " + address.get(parseCount));
                System.out.println("Phone Number: " + phone.get(parseCount));

                parseCount = parseCount + 1;
            }

            number = number + 10;

        }
    }
}

1 个答案:

答案 0 :(得分:0)

了解如何使用Chrome开发者工具的Inspect元素,因为它可以非常轻松地在DOM中找到元素(您说您对HTML不满意,您肯定会在此之后使用Inspect是一个伟大的学习工具)。将检查员聚焦在“立即查看”按钮上,您将看到:

<a href="https://www.yelp.com/biz_redir?cachebuster=1479918865&amp;s=1c73b4bdc9110f6e6dc72fff48cd6379d6eaac0cd6d15794a9414e546ad5a927&amp;src_bizid=U2eO8yFSc9YTf_SPnog8cw&amp;url=http%3A%2F%2Fwww.lacocinanewark.com%2F%23%21menu%2Fcl69&amp;website_link_type=cta" rel="nofollow" target="_blank" class="ybtn ybtn--primary ybtn--small ybtn-cta" data-component-bound="true">View Now</a>

你必须弄清楚如何遍历到这一点,childNodes()将有助于遍历。然后,您可以使用getElementsByClass("ybtn ybtn--primary ybtn--small ybtn-cta")访问链接所在的特定类,然后使用.attr()类的Element方法获取href:.attr("href");。< / p>