Question

我的Chrome书签太乱了。所以我导出它，并决定编写一个Python程序来清理我的书签。 例如：按关键字排序。

我找到了美丽的汤。但问题是，导出文件使用的是Netscape书签文件格式，而不是标准XML。 Beautiful Soup会尝试将它们转换为标准的XTML格式。 Chrome无法阅读。

还有其他解决方案吗？

Answer 1

默认情况下，Chrome会将您的书签存储为JSON，例如：

C:\Users\user\AppData\Local\Google\Chrome\User Data\Default\Bookmarks

对于Linux用户：

~/.config/chrome/Default/Bookmarks

（此文件的位置因使用平台而异。）

您可能会发现此文件比HTML导出更容易操作。

Answer 2

我有同样的问题。现在我正在做一个Python书签工具包，只是为了清理Chrome中的杂乱书签。

在github上发表书签：https://github.com/allengaller/bookmarkit

我认为使用Chrome查找书签文件对您没有帮助。除非您将JSON文件解析为Dict（我看到您打开了另一个关于此的问题，我认为您已经将SGML书签文件保留下来了。）

我的解决方案是：

使用CLI来管理书签是死路一条，因为对于那些真正需要工具JUST来管理书签的人来说这是一个非常艰难的进步（大多数人都有像我这样的10M +书签文件），我会用PyGTK或PyQT提供简单的基于drop-and-throw的GUI。
关于BS更改文件：忘记BS将对您的书签文件进行更改。每次完成解析文件时，生成一个NETSCAPE-BOOKMARK文件，而不是使用原始文件（即使它没有被更改）
尝试使用ElementTree库。
见这里：http://docs.python.org/library/xml.etree.elementtree.html 我认为解析SGML比直接更改Chrome正在使用的JSON文件更安全。因为像我这样的重度用户非常重视我们的数据，我宁愿小心导出，导入我的工具包，完成我的工作，然后导入回Chrome。这一进展最好是明确。

Answer 3

我想出了如何使用Node.js做到这一点。只需安装cheerio（npm install -S cheerio）并通过环境变量或命令行参数添加inputFile和outputFile的名称。这是我的解决方案：

const fs = require('fs')
const path = require('path')
const cheerio = require('cheerio')
const inputFile = process.env.INPUT || process.argv[2] || 'bookmarks.html'
const outputFile = process.env.OUTPUT || process.argv[3] || 'bookmarks.json'
const inputFilePath = path.resolve(inputFile)
const outputFilePath = path.resolve(outputFile)

fs.readFile(inputFilePath, { encoding: 'utf8' }, (error, data) => {
  if (error)
    return console.error(error)
  const $ = cheerio.load(data)
  function parseTerm(element, out) {
    const item = {}
    if (element.name === 'dt') {
      parseTerm($(element).children(':not(p)').first().get()[0], out)
    } else if (element.name === 'h3') {
      item.title = $(element).text()
      item.type = 'folder'
      item.updated = $(element).attr('last_modified')
      item.children = []
      out.push(item)
      parseList($(element).next(), item.children)
    } else if (element.name === 'a') {
      item.title = $(element).text()
      item.type = 'link'
      item.added = $(element).attr('add_date')
      item.href = $(element).attr('href')
      item.icon = $(element).attr('icon')
      out.push(item)
    }
  }
  function parseList(list, out) {
    list.children(':not(p)').each(function (index) {
      parseTerm(this, out)
    })
  }
  const out = []
  parseList($('dl').first(), out)
  fs.writeFile(outputFilePath, JSON.stringify(out, null, 2), error => {
    if (error)
      return console.error(error)
    console.log('Success!')
  })
})

在Python上使用Netscape书签文件格式？

3 个答案: