从Chrome书签html文件中检索信息

时间:2018-07-09 02:51:18

标签: java regex parsing html-parsing

我想解析包含导出的Google Chrome书签的文件。这是一个.html文件。对于每个书签,我都对URL,ADD_DATE和超链接标记末尾的标题感兴趣。

这是Chrome书签html文件的摘要。

<!DOCTYPE NETSCAPE-Bookmark-file-1>
<!-- This is an automatically generated file.
 It will be read and overwritten.
 DO NOT EDIT! -->
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Bookmarks</TITLE>
<H1>Bookmarks</H1>
<DL><p>
    <DT><A HREF="https://www.programcreek.com/2011/03/java-write-to-a-file-code-example/" ADD_DATE="1508652899" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABVUlEQVQ4jcWSwYoTURBFz30vDTG8xcTgIhsHYdwFI7jRf/G3/AT/pA0mkF9woSLBTkyMbfKa6a5yIQmIDCIRvLu6VB2qLiV3F/9TcverSwGbvx1y4HS33N0v2aB3F+BkuzsxxrPf3LZkhxQDkhMkIemXIeDsxRi1Wq0o38z4vvvCxzbw6t2RGCUjEMxMXddJkkIIAmRmMjOFELTZbFgul6qqzyrfLnT/9ps+NdLr95mISW3beoyR7XZLXdeMRiMGgwHuTs6Z2WzG8XikKApy0/BgeMWz5y/40IhH9/QzxPV67YvFgq7r6Pf7TCYTxuMx8/mcqqooigJ3J4RAzpnr64dMp09lZq66rq0sSw6HA0VR0LYtIQRSSuz3e3q93m/ZNE3Dk+mUxzc39KqqspQSw+HwHKQkuq4jpXSuTxBJmBlfdztyzv/gD4CXlwAultw9/rntbv0A1ZC8BgHlLSQAAAAASUVORK5CYII=">How to Write a File Line by Line in Java?</A>
    <DT><A HREF="https://stackoverflow.com/questions/2885173/how-do-i-create-a-file-and-write-to-it-in-java" ADD_DATE="1508652914">How do I create a file and write to it in Java? - Stack Overflow</A>
    <DT><A HREF="https://www.javacodegeeks.com/2010/05/getting-started-with-youtube-java-api.html" ADD_DATE="1508996959" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABvUlEQVQ4jZWTP2sUURTFf3dmxcUUCVgFHZwU1m66dBmbLOkmXRy2WBtBDMh+gjCfIEg2hAiCi8tgY1axkKTZEQsTm52tLTISUoo2gQ1m3rXYP4TdjCavu+8ezj2cey5c8wV151ew5bSGdeG6BAoJSjqs5V/gqWC/lFnqKzwoGAlPo6VkHHOpgmDLqQLsflUXWBcgs7QLXI0AZQMktSA0oz+9d6uy5xsoqTB99qZcmyAoVj+5nOMefu8kzu1vDVMwyZ3pTnzycz7kBqk5lyMAUQBqEx7crOy1BPx+ZR6uLDxZF/CitWMZ9DsCpb52M9d7vZwWKpt3PSPSRiXcPaTLiMDyRGhg5PPqtut+/PJyJhNeiDEzKpoUKf7uDRU8qjttC/nw7mAnUaznAl3byPvTaCm5OLXXLE9sLXeNQd05UkhbB68A9QBsI/Pjq8wNkkIqkKIGRACJL8PlKrgYorNmeQUG+VAWo7Xjx0OclUeQWWzQD5E/FeyXBrIWgerqtuv+lwA1o7xndt9EY9uhse25t0/TUS//mEQbID/AxAv3d9zZutMW86cWPTu5mom95nIMxACzm44v4InaHmP38Bf/laoOI/FjiQAAAABJRU5ErkJggg==">Getting Started with YouTube Java API | Java Code Geeks - 2017</A>
</DL><p>

请注意,有些书签具有“ ICON”属性,有些则没有。

我想检索除“ ICON”值以外的所有内容。我的目标是从文件中检索信息并将其存储在数据库中,以便在另一个应用程序中组织和利用数据。

为此,我研究了正则表达式,但是并没有太多的经验来使它完全起作用。我首选的语言是Java,但如果Python可以更好地工作,我可以使用它。

1 个答案:

答案 0 :(得分:1)

尝试正则表达式:<DT><A HREF=\"(.*?)(?=\")\" ADD_DATE=\"(\d+)\".*?>([\s\S]+?)<\/A>

Demo

链接将在第1组中,第2组中的日期,在第3组中的标题