我有一个RSS(XML)文件,并希望将其转换为具有人类可读文本(无格式化)的JSON文件。 (也许消毒不是正确的搜索词?)
XML的示例如下所示
<description><p><strong><img alt=""
src="/site/sites/default/files/ReligionUN.png"
style="width: 43px; height: 34px; float: left;"
/>June 20</strong><br />
&nbsp;</p>
<p>The UN World Refugee Day was agreed upon in 2001 in
connection with the celebration of the Refugee Convention&
#39;s fiftieth anniversary. The date was chosen because the
Organization of African Unity already celebrated Africa Refugee
Day on June 20.&nbsp;</p><p>&nbsp;</p>
<p>The Holiday Calendar is sponsered by:</p>
<p><img alt="" p=""
src="/site/sites/default/files/alle_logoer_800x600.png"
style="width: 800px; height: 600px;" /></p>
</description>
我想要实现的目标如下:
"description": "December 18\nThe UN International Migrants' Day
marks the adoption of the International Migrant Workers Convention
on December 18, 1990.\nThe UN wished to emphasize that
transnational migration is a growing phenomenon, which can
contribute to growth and development across the world provided that
the international community assure migrants' rights.\n\nThe Holiday
Calendar is sponsered by:\n"
我需要清理XML或JSON上的文本(更喜欢第一个)。使用以下代码:
const fs = require('fs')
const convert = require('xml-js')
const _ = require('lodash')
const striptags = require('striptags')
const xmlstr = fs.readFileSync('./english.xml', 'utf8')
const json_html = convert.xml2json(xmlstr, { compact: true, spaces: 4 })
const json_stripped = striptags(
_.replace(json_html, new RegExp(' ', 'g'), '')
)
fs.writeFileSync('./english.json', json_stripped)
我到目前为止
"description": "December 18\n\nThe UN International Migrants'
Day marks the adoption of the International Migrant Workers
Convention on December 18, 1990.\nThe UN wished to emphasize that
transnational migration is a growing phenomenon, which can
contribute to growth and development across the world provided that
the international community assure migrants' rights.\nThe
Holiday Calendar is sponsered by:\n\n\n\n\n\n\n\n"
它几乎就在那里,但正如你所看到的,我仍然很难找到如何替换
,'
等内容并将多个换行符缩小为单换行符。
答案 0 :(得分:1)
你想要unescape /解码html。它有一堆包。
console.log(entities.decode('<>"'&©®∆')); // <>"'&©®∆
答案 1 :(得分:0)
以下代码完成了这项工作
const fs = require('fs')
const convert = require('xml-js')
const Entities = require('html-entities').AllHtmlEntities
const striptags = require('striptags')
const xmlstr = fs.readFileSync('./rss.xml', 'utf8')
const json_html = convert.xml2json(xmlstr.replace(/&quot;/g, "'"), { compact: true, spaces: 4 })
const entities = new Entities()
const json_stripped = striptags(entities.decode(json_html))
fs.writeFileSync('./rss.json', json_stripped)