从HTML元素中提取文本并创建对象

时间:2018-08-26 16:35:48

标签: javascript node.js regex

我正在尝试整理以下我使用正则表达式需要的代码。

这是我从网站上获取后保存到变量中的文本。

[ '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Name: </font><a href="site.php?page=send&sendto=Username"><font color="#999999">Username</font></a>&nbsp;&nbsp;&nbsp;</td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Crew: </font><a href="site.php?page=crewprofile&id=2120"><font color="#999999">My Crew</font></a> </td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Wealth: Rich</font></td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Rank: Hitman</td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Status: Alive ( </font><font color=green>Online</font><font color="#999999"> )</font><tr><td bgcolor="#2D2F34">&nbsp;<font color="#999999">Messages sent: 3</font></td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Messages received: 1</font></td>' ]

此文本也可以包含更多或更少的标签,因为它们是从每个“配置文件”不同的网站获取的。

我希望它返回

Name: Username   
Crew: My Crew   
Wealth: Rich   
Rank: Hitman
Status: Alive ( Online )
Messages sent: 3
Messages received: 1

感谢所有帮助!谢谢

1 个答案:

答案 0 :(得分:1)

您可以使用DocumentFragment<td>个元素中提取所需的数据。
对于Node,请看一下这样的一些帮助器:jsdom@npmjs

const td = [ '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Name: </font><a href="site.php?page=send&sendto=Username"><font color="#999999">Username</font></a>&nbsp;&nbsp;&nbsp;</td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Crew: </font><a href="site.php?page=crewprofile&id=2120"><font color="#999999">My Crew</font></a> </td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Wealth: Rich</font></td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Rank: Hitman</td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Status: Alive ( </font><font color=green>Online</font><font color="#999999"> )</font></td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Messages sent: 3</font></td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Messages received: 1</font></td>' ];

const tr = document.createElement("tr");
const table = document.createElement("table");
const frag = document.createDocumentFragment(); // Minimal Document wrapper

tr.innerHTML = td.join("");
table.appendChild(tr);
frag.appendChild(table);

const data = [...frag.querySelectorAll("td")].reduce((ob, td) => {
  const a = td.textContent.split(':');
  ob[a[0].trim()] = a.slice(1).join(":").trim();
  return ob;
}, {})

console.log( data );

PS:

!!!?在您的数组中,您有一个</font><tr><td←应该是</font></td>', '<td-我在上面固定了(不必...因为它是正确解析)。是的,首先请确保您至少获得格式正确的HTML array

正是这样的事情,使用正则表达式解析HTML是个坏主意。即使出现上述错误-HTML也会正确解析 -sh -但是严格使用regexp提取内容会使其完全失败。


为节点使用jsdom-您的代码应类似于:

const jsdom = require("jsdom");
const { JSDOM } = jsdom;

const td = ['<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Name: </font><a href="site.php?page=send&sendto=Username"><font color="#999999">Username</font></a>&nbsp;&nbsp;&nbsp;</td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Crew: </font><a href="site.php?page=crewprofile&id=2120"><font color="#999999">My Crew</font></a> </td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Wealth: Rich</font></td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Rank: Hitman</td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Status: Alive ( </font><font color=green>Online</font><font color="#999999"> )</font></td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Messages sent: 3</font></td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Messages received: 1</font></td>'];

const dom = new JSDOM(`<table><tr>${td.join("")}</tr></table>`);
const frag = dom.window.document;

const data = [...frag.querySelectorAll("td")].reduce((ob, td) => {
    const a = td.textContent.split(':');
    ob[a[0].trim()] = a.slice(1).join(":").trim();
    return ob;
}, {});

console.log( data );