Question

我有一个RSS（XML）文件，并希望将其转换为具有人类可读文本（无格式化）的JSON文件。（也许消毒不是正确的搜索词？）

XML的示例如下所示

<description>&lt;p&gt;&lt;strong&gt;&lt;img alt=&quot;&quot;
 src=&quot;/site/sites/default/files/ReligionUN.png&quot; 
 style=&quot;width: 43px; height: 34px; float: left;&quot; 
 /&gt;June 20&lt;/strong&gt;&lt;br /&gt;
 &amp;nbsp;&lt;/p&gt;
  &lt;p&gt;The UN World Refugee Day was agreed upon in 2001 in
 connection with the celebration of the Refugee Convention&amp;
 #39;s fiftieth anniversary. The date was chosen because the 
 Organization of African Unity already celebrated Africa Refugee 
 Day on June 20.&amp;nbsp;&lt;/p&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;
 &lt;p&gt;The Holiday Calendar is sponsered by:&lt;/p&gt;
 &lt;p&gt;&lt;img alt=&quot;&quot; p=&quot;&quot;
 src=&quot;/site/sites/default/files/alle_logoer_800x600.png&quot;
 style=&quot;width: 800px; height: 600px;&quot; /&gt;&lt;/p&gt;
</description>

我想要实现的目标如下：

"description": "December 18\nThe UN International Migrants' Day
  marks the adoption of the International Migrant Workers Convention
  on December 18, 1990.\nThe UN wished to emphasize that
  transnational migration is a growing phenomenon, which can
  contribute to growth and development across the world provided that
  the international community assure migrants' rights.\n\nThe Holiday 
  Calendar is sponsered by:\n"

我需要清理XML或JSON上的文本（更喜欢第一个）。使用以下代码：

const fs = require('fs')
const convert = require('xml-js')
const _ = require('lodash')
const striptags = require('striptags')

const xmlstr = fs.readFileSync('./english.xml', 'utf8')

const json_html = convert.xml2json(xmlstr, { compact: true, spaces: 4 })

const json_stripped = striptags(
  _.replace(json_html, new RegExp('&nbsp;', 'g'), '')
)

fs.writeFileSync('./english.json', json_stripped)

我到目前为止

"description": "December 18\n\nThe UN International Migrants&#39; 
  Day marks the adoption of the International Migrant Workers
  Convention on December 18, 1990.\nThe UN wished to emphasize that
  transnational migration is a growing phenomenon, which can
  contribute to growth and development across the world provided that
  the international community assure migrants&#39; rights.\nThe
  Holiday Calendar is sponsered by:\n\n\n\n\n\n\n\n"

它几乎就在那里，但正如你所看到的，我仍然很难找到如何替换 ，'等内容并将多个换行符缩小为单换行符。

Answer 1

你想要unescape /解码html。它有一堆包。

赞this one

console.log(entities.decode('&lt;&gt;&quot;&apos;&amp;&copy;&reg;&#8710;')); // <>"'&&copy;&reg;∆

Answer 2

以下代码完成了这项工作

const fs = require('fs')
const convert = require('xml-js')
const Entities = require('html-entities').AllHtmlEntities
const striptags = require('striptags')

const xmlstr = fs.readFileSync('./rss.xml', 'utf8')

const json_html = convert.xml2json(xmlstr.replace(/&amp;quot;/g, "'"), { compact: true, spaces: 4 })

const entities = new Entities()

const json_stripped = striptags(entities.decode(json_html))

fs.writeFileSync('./rss.json', json_stripped)

“消毒”RSS成为人类可读的

2 个答案: