Question

抱歉，如果这是一个很大的问题，但我只是想找个人告诉我学习更多的方向，因为我不知道，我对HTML和Java有非常基本的了解。

我家中的某个人必须将供应商的每件产品都复制到他自己的网店。问题是他需要手工逐一放入所有文章，我正在寻找一种方法来替换他。

我已经有点计算价格了，我现在需要的只是产品的信息。

从1009行到1030左右。我需要三个跨度的3个独立字符串和“CatalogusListDetailTest”类从987行到1000左右。我需要一种方法来获取所有这些图像，它位于网站www.flamingo.be/Images/Products/Large/"productID"(our first string）.jpg 有时会有一个_A，_B，你可以在这个例子中看到，所以我正在寻找一种方法来检查是否有并获得这些图像。

如果我能做到这一点，那我就非常感激！我会自己解决剩下的问题，对于长篇文章感到抱歉，想尽可能多地提供信息。

Answer 1

您可以查看HTML解析器库Jsoup，文档参考：http://jsoup.org/cookbook/

编辑：获取产品代码的代码：

    Elements classElements = document.getElementsByClass("CatalogusListDetailTextTitel");
            for (Element classElement : classElements) {
                if (classElement.text().contains("Productcode :")) {
                System.out.println(classElement.parent().ownText());
                }
            }

代替document您可能必须使用元素来获得一致的结果，上面的代码将打印所有产品代码。

Answer 2

您可以根据需要使用JTidy。

代码示例：

public void downloadSinglePage(String pageLink, String targetDir) throws XPathExpressionException, IOException {
                URL url = new URL(pageLink);
                BufferedInputStream page = new BufferedInputStream(url.openStream());

                Tidy tidy = new Tidy();
                tidy.setQuiet(true);
                tidy.setShowWarnings(false);
                Document response = tidy.parseDOM(page, null);

                XPathFactory factory = XPathFactory.newInstance();
                XPath xPath=factory.newXPath();
                NodeList nodes = (NodeList)xPath.evaluate(IMAGE_PATTERN, response, XPathConstants.NODESET);

                String imageURL = (String) nodes.item(0).getNodeValue();
                saveImageNIO(imageURL, targetDir);

        }

其中

IMAGE_PATTERN = "///a/img/@src";

但模式取决于页面HTML代码中图像的嵌入方式。

使用NIO保存图片的方法：

public void saveImageNIO(String imageURL, String targetDir, String imageName) throws IOException {
            URL url = new URL(imageURL);
            ReadableByteChannel rbc = Channels.newChannel(url.openStream());
            FileOutputStream fos = new FileOutputStream(targetDir + "/" + imageName + ".jpg");
            fos.getChannel().transferFrom(rbc, 0, 1 << 24);
        }

从java中的网页获取信息

2 个答案: