Question

我一直在尝试Web抓取，并想尝试使用Node JS来做到这一点。我在使用requests模块和BeautifulSoup4在python中进行网络抓取方面有一些经验，我想在Node JS中重新创建代码。但是，当基本上镜像我的代码时（除了更改一些内容以解决语法上的差异外），我找不到所需的html标签。我将JSsoup与Node JS一起使用，因为它是我能找到的与BeautifulSoup最接近的东西。到目前为止，这是我的代码：

const request = require('request');
var jssoup = require('jssoup').default;

const options = {
  url: 'https://kith.com/collections/footwear/products/nkaj7292-002.xml',
  headers: {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
  }
};
function getVariant(error, response, body) {
  if (!error && response.statusCode == 200) {
      var soup = new jssoup(body);
      var nametag = soup.find('title');
      var product = nametag.text;
      console.log(product);
      var sizetag = soup.find('title', { string:'9' });
      console.log(sizetag);
  }
}
request(options, getVariant);

代码最终正确地找到了一个标签（<title> Nike Zoom Vomero 5/ACW (Black/Reflect Silver/Anthracite) AT3152-001 </title>），但为第二个标签返回了“未定义”。作为参考，下面是它尝试查找的标签：<title>9</title>

我也尝试过使用=代替字典，并使用内容和名称代替字符串，但到目前为止还算不上运气。我在这里做什么错了？

我也尝试查看JSsoup文档，但是find（）上没有太多内容。

Answer 1

正如see in the source一样，期望将任何要匹配的string作为.find的第三个参数，因此：

let sizetag = soup.find('title', undefined, '9');

我同意Scott Sauyet的观点，即提出问题可能是明智的选择，特别是对于修复文档而言

Answer 2

要使用soup.find获取的innerText，请使用：

<targetElement>.contents[0]._text

我还试图在 Node JS 的 JSsoup 中抓取html，发现它返回了 object ：

SoupTag {
  name: 'time',                           // name refers tagname
  contents: [ SoupString {.               // contents is array
      parent: [Circular *2],
      previousElement: [Circular *2],
      nextElement: [SoupTag],
      _text: '22 hours ago'              // here's innerText       
    }],
  attrs: { class: 'post-last-modified-td' },
  hidden: false,
  builder: TreeBuilder {
    EMPTY_ELEMENT_TAGS: Set(24) {...} 
  }
}

这是我的代码：

find_time = soup.find("time", "post-last-modified-td");
if (find_update != undefined) console.log("Updated", find_time.contents[0]._text);

它返回：

Updated 22 hours ago

即使Node JS中存在标签，也无法使用JSsoup查找标签

2 个答案: