“消毒”RSS成为人类可读的

时间:2017-07-01 11:25:07

标签: javascript json node.js xml

我有一个RSS(XML)文件,并希望将其转换为具有人类可读文本(无格式化)的JSON文件。 (也许消毒不是正确的搜索词?)

XML的示例如下所示

<description>&lt;p&gt;&lt;strong&gt;&lt;img alt=&quot;&quot;
 src=&quot;/site/sites/default/files/ReligionUN.png&quot; 
 style=&quot;width: 43px; height: 34px; float: left;&quot; 
 /&gt;June 20&lt;/strong&gt;&lt;br /&gt;
 &amp;nbsp;&lt;/p&gt;
  &lt;p&gt;The UN World Refugee Day was agreed upon in 2001 in
 connection with the celebration of the Refugee Convention&amp;
 #39;s fiftieth anniversary. The date was chosen because the 
 Organization of African Unity already celebrated Africa Refugee 
 Day on June 20.&amp;nbsp;&lt;/p&gt;&lt;p&gt;&amp;nbsp;&lt;/p&gt;
 &lt;p&gt;The Holiday Calendar is sponsered by:&lt;/p&gt;
 &lt;p&gt;&lt;img alt=&quot;&quot; p=&quot;&quot;
 src=&quot;/site/sites/default/files/alle_logoer_800x600.png&quot;
 style=&quot;width: 800px; height: 600px;&quot; /&gt;&lt;/p&gt;
</description>

我想要实现的目标如下:

"description": "December 18\nThe UN International Migrants' Day
  marks the adoption of the International Migrant Workers Convention
  on December 18, 1990.\nThe UN wished to emphasize that
  transnational migration is a growing phenomenon, which can
  contribute to growth and development across the world provided that
  the international community assure migrants' rights.\n\nThe Holiday 
  Calendar is sponsered by:\n"

我需要清理XML或JSON上的文本(更喜欢第一个)。使用以下代码:

const fs = require('fs')
const convert = require('xml-js')
const _ = require('lodash')
const striptags = require('striptags')

const xmlstr = fs.readFileSync('./english.xml', 'utf8')

const json_html = convert.xml2json(xmlstr, { compact: true, spaces: 4 })

const json_stripped = striptags(
  _.replace(json_html, new RegExp('&nbsp;', 'g'), '')
)

fs.writeFileSync('./english.json', json_stripped)

我到目前为止

"description": "December 18\n\nThe UN International Migrants&#39; 
  Day marks the adoption of the International Migrant Workers
  Convention on December 18, 1990.\nThe UN wished to emphasize that
  transnational migration is a growing phenomenon, which can
  contribute to growth and development across the world provided that
  the international community assure migrants&#39; rights.\nThe
  Holiday Calendar is sponsered by:\n\n\n\n\n\n\n\n"

它几乎就在那里,但正如你所看到的,我仍然很难找到如何替换&nbsp;&#39;等内容并将多个换行符缩小为单换行符。

2 个答案:

答案 0 :(得分:1)

你想要unescape /解码html。它有一堆包。

this one

console.log(entities.decode('&lt;&gt;&quot;&apos;&amp;&copy;&reg;&#8710;')); // <>"'&&copy;&reg;∆ 

答案 1 :(得分:0)

以下代码完成了这项工作

const fs = require('fs')
const convert = require('xml-js')
const Entities = require('html-entities').AllHtmlEntities
const striptags = require('striptags')

const xmlstr = fs.readFileSync('./rss.xml', 'utf8')

const json_html = convert.xml2json(xmlstr.replace(/&amp;quot;/g, "'"), { compact: true, spaces: 4 })

const entities = new Entities()

const json_stripped = striptags(entities.decode(json_html))

fs.writeFileSync('./rss.json', json_stripped)