Question

我是使用Java进行网页抓取的新手（我相信这是正确的术语）并且一直试图找到一个关于我正在尝试的内容的好教程：

我想在我创建的程序中有一个类，它扫描给定网站的所有数据并存储它。然后我可以在我的Main类中使用这些数据。

我要求有人用正确的方向指出我正确的方向我会问什么，或者有人可以解释我将如何编程。

Answer 1

Okay I'll try to answer this in a better way from the other. First let me say that if you aren't familiar with DOM parsing or any type of document parsing you will probably find this quite difficult.

The first thing your going to need to do is turn the HTML into a document. Using JSoup you can do this with:

 Document doc = Jsoup.connect("http://example.com")
  .data("query", "Java")
  .userAgent("Mozilla")
  .cookie("auth", "token")
  .timeout(3000)
  .post();

Now you have a document called "Doc". This document is going to be fully structured as the HTML obviously. In order to "parse" this document you are going to have to do some serious navigation. There is no magical "Parse entire document" code unfortunately. (Same goes for parsing XML, trust me I just had to parse an XML with over 100 nodes and it was time consuming).

So to navigate it would be very helpful if you have understanding of the structure of the HTML. You might consider using Print on "doc" so you can actually see what the HTML looks like before you go any further.

Once you know your tag names you can use a wide variety of methods like

getElementById(String id)

Of course you could save that to a String.

Your going to need to use loops and ArrayLists in situations where they are multiple tags of the same name.

I'm not going to go much further into the methods because your just really going to have to practice. I know using a DOM parser with XML the process I used was to getTextContent() but I'm not sure if that applies here.

Here is an example of how I used the DOM parser to parse an XML file (note that I used XPath to navigate my document which may be different than how you do it)

XPathExpression RfrdDocInfNbexpr = xpath.compile("//Ntfctn/Ntry/NtryDtls/TxDtls/RmtInf/Strd/RfrdDocInf/Nb");
            Object RfrdDocInfNb = RfrdDocInfNbexpr.evaluate(doc, XPathConstants.NODESET);
            NodeList nodesRfrdDocInfNb = (NodeList) RfrdDocInfNb;
            for(int i = 0; i < nodesRfrdDocInfNb.getLength(); i++){
                Element RfrdDocInfNbel = (Element) nodesRfrdDocInfNb.item(i);
                RfrdDocInfNbS = Utilities.xmlToString(RfrdDocInfNbel);
                int length = RfrdDocInfNbS.length();
                RfrdDocInfNbS = RfrdDocInfNbS.substring(42,length);
                length = RfrdDocInfNbS.length();
                RfrdDocInfNbS = RfrdDocInfNbS.substring(0,length-5);
                RfrdDocInfNbAL.add(RfrdDocInfNbS);

            }

So what did I do there?

XPathExpression RfrdDocInfNbexpr = xpath.compile("//Ntfctn/Ntry/NtryDtls/TxDtls/RmtInf/Strd/RfrdDocInf/Nb");

Sets the path of the element (also called a node) that I want to extract the value from.

Object RfrdDocInfNb = RfrdDocInfNbexpr.evaluate(doc, XPathConstants.NODESET);

Then create an object from that.

NodeList nodesRfrdDocInfNb = (NodeList) RfrdDocInfNb;

Creates a list of all those objects. (Since there may be multiple tags with the same name, in fact in my XML there were 60 of each tag).

Element RfrdDocInfNbel = (Element) nodesRfrdDocInfNb.item(i);

Turns my node into an element. Since your using HTML, you may be able to just BEGIN at this part - Getting an element is your objective.

RfrdDocInfNbS = Utilities.xmlToString(RfrdDocInfNbel);

This is important! This is how to turn an element into a String. I had a lot of trouble with this part but that turns the element into a String. Since your using HTML obviously this wont work but the point is you will have to figure out how to turn an HTML element into a String.

So that is how I used a parser to go through my XML and extract everything into ArrayLists and Strings. I had many blocks of code like that.

If you REALLY want to undertake this project I suggest doing research on the JSoup website here: http://jsoup.org/cookbook/extracting-data/dom-navigation.

And again, this is an advanced project so don't expect to understand this in a day I would expect it to take at least a week of reading and practice unless you are already familiar with parsing.

Java Web扫描到文本文件

1 个答案: