Question

我正在使用import java.awt.*; import javax.swing.*; public class KDHS { public static void main(String[] args) { PzImage watch = new PzImage("C:\\Users\\rlaeh\\Desktop\\김동혁\\JAVA_workspace\\watch.jpg"); } } class PzImage { int wid, hei; ImageIcon img; ImageIcon [][] piece = new ImageIcon[5][5]; Dimension dim; Robot robot = new Robot(); //under line at "new Robot();" JFrame frm = new JFrame("seperating"); protected PzImage(String img_path) { img = new ImageIcon(img_path); wid = img.getIconWidth(); hei = img.getIconHeight(); frm.add(new ImagePanel(img)); frm.setSize(wid, hei); frm.setVisible(true); setPzImage(); } protected void setPzImage() { int sectwid, secthei; sectwid = wid/5; secthei = hei/5; for(int a=0; a<5; a++) for(int b=0; b<5; b++) piece[a][b]=new ImageIcon(robot.createScreenCapture(new Rectangle(b*sectwid, a*secthei, sectwid, secthei))); } } class ImagePanel extends JPanel{ ImageIcon img; Dimension section; protected ImagePanel(String img_path) { img = new ImageIcon(img_path); setPanel(); } protected ImagePanel(ImageIcon img) { this.img = img; setPanel(); } protected void setPanel() { section = new Dimension(img.getIconWidth(), img.getIconHeight()); add(new JLabel(img)); setPreferredSize(section); setMaximumSize(section); setMinimumSize(section); setVisible(true); } }从不同网页的html代码中提取信息。但是，有一个网站的脚本标签中包含了我想提取的文本；因此，Cheerio方法无法访问该代码段。

因此，在寻找解决方案时，我发现在网上有可能使用puppeteer运行该脚本，puppeteer是处理chrome实例的API节点。使用这种方法，即使不是最好的方法，因为我几天前就发现了它，但最终我获得了所需的html代码。不幸的是，我无法提取所需的信息。这是我要从中提取数据的html代码：

Cheerio

这是我用来成功提取文本数据的代码：

<h2 class="property-price">
  <a href="blablabla">
    <strong>
      <font style="vertical-align: inherit;">
        <font style="vertical-align: inherit;">Text that I wanna extract</font>
      </font>
      <small></small>
    </strong>
  </a>  
</h2>

我确定这不是获取所需数据文本的最佳方法，因此，如果您有一些建议，我会很乐意接受。此外，我想知道是否可以直接使用puppeteer API提取我需要的内容，或者是否需要使用var cheerio = require("cheerio"); const puppeteer = require('puppeteer'); var $; const POST_LINK_SELECTOR = 'div.property-title'; (async() => { const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); await page.goto('myUrl',{ timeout: 0 }); $=cheerio.load(renderedContent); console.log($('h2.property-price').find('font').children().text()); await browser.close(); })();（就像我在我的情况下所做的那样，无论如何都行不通）。谢谢

Answer 1

您可以在page.evaluate方法的帮助下，通过操纵up来找到所需的数据：

(async() => {
    const browser = await puppeteer.launch({ headless: false });
    const page = await browser.newPage();
    await page.goto('myUrl',{waitUntil: "networkidle0"});

    const text = await page.evaluate(() => document.querySelector("h2.property-price a").textContent.trim() )
    console.log(text);

    await browser.close();
})();

如果您想继续使用Cheerio的类似jQuery的语法，也可以这样做，只需将jQuery添加到页面中（如果该站点没有大面积使用）

await page.goto(...);
await page.addScriptTag({url: 'https://code.jquery.com/jquery-3.2.1.min.js'});

从nodeJs上的字体标签中提取文本

1 个答案: