Question

我正在将我的书签从kippt.com移到pinboard.in。

我从Kippt导出书签，出于某种原因，他们在相同的字段中存储标签（以＃开头）和描述。 Pinboard保持标签和描述分开。

这是出口后Kippt书签的样子：

<DT><A HREF="http://www.example.org/" ADD_DATE="1412337977" LIST="Bookmarks">This is a title</A>
<DD>#tag1 #tag2 This is a description

这是导入Pinboard之前的样子：

<DT><A HREF="http://www.example.org/" ADD_DATE="1412337977" LIST="Bookmarks" TAGS="tag1,tag2">This is a title</A>
<DD>This is a description

基本上，我需要将#tag1 #tag2替换为TAGS="tag1,tag2"并将其移至<A>内的第一行。

我一直在阅读有关移动数据块的信息：sed or awk to move one chunk of text betwen first pattern pair into second pair?

到目前为止，我还没有想出一个好的食谱。有什么见解吗？

编辑：

以下是输入文件外观的实际示例（3500个中的3个条目）：

<DT><A HREF="http://phabricator.org/" ADD_DATE="1412973315" LIST="Bookmarks">Phabricator</A>
<DD>#bug #tracking 

<DT><A HREF="http://qz.com/261426/the-hidden-commands-for-diagnosing-and-improving-your-netflix-streaming-quality/" ADD_DATE="1412838293" LIST="Inbox">The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz</A>

<DT><A HREF="http://www.farmholidays.is/" ADD_DATE="1412337977" LIST="Bookmarks">Icelandic Farm Holidays | Local experts in Iceland vacations</A>
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland

Answer 1

这可能不是最美丽的解决方案，但因为它似乎是一次性的事情应该足够了。

import re
dt = re.compile('^<DT>')
dd = re.compile('^<DD>')


with open('bookmarks.xml', 'r') as f:
    for line in f:
        if re.match(dt, line):
            current_dt = line.strip()
        elif re.match(dd, line):
            current_dd = line
            tags = [w for w in line[4:].split(' ') if w.startswith('#')]
            current_dt = re.sub('(<A[^>]+)>', '\\1 TAGS="' + ','.join([t[1:] for t in tags]) + '">', current_dt)
            for t in tags:
                current_dd = current_dd.replace(t + ' ', '')
            if current_dd.strip() == '<DD>':
                current_dd = ""
        else:
            print current_dt
            print current_dd
            current_dt = ""
            current_dd = ""

    print current_dt
    print current_dd

如果代码的某些部分不清楚，请告诉我。您当然可以使用python将行写入文件而不是打印它们，甚至可以修改原始文件。

编辑：添加了if子句，以便空的<DD>行不会显示在结果中。

Answer 2

script.awk

BEGIN{FS="#"}

/^<DT>/{
    if(d==1) print "<DT>"s # for printing lines with no tags 
    s=substr($0,5);tags="" # Copying the line after "<DT>". You'll know why
    d=1
}

/^<DD>/{
    d=0
    m=match(s,/>/) # Find the end of the HREF descritor first match of ">"
    for(i=2;i<=NF;i++){sub(/ $/,"",$i);tags=tags","$i} # Concatenate tags
    td=match(tags,/ /) # Parse for tag description (marked by a preceding space).
    if(td==0){ # No description exists
        tags=substr(tags,2)
        tagdes=""
    }
    else{ # Description exists
        tagdes=substr(tags,td)
        tags=substr(tags,2,td-2)
    }
    print "<DT>" substr(s,1,m-1) ", TAGS=\"" tags "\"" substr(s,m)
    print "<DD>" tagdes
}

awk -f script.awk kippt > pinboard

INPUT

<DT><A HREF="http://phabricator.org/" ADD_DATE="1412973315" LIST="Bookmarks">Phabricator</A>
<DD>#bug #tracking 

<DT><A HREF="http://qz.com/261426/the-hidden-commands-for-diagnosing-and-improving-your-netflix-streaming-quality/" ADD_DATE="1412838293" LIST="Inbox">The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz</A>

<DT><A HREF="http://www.farmholidays.is/" ADD_DATE="1412337977" LIST="Bookmarks">Icelandic Farm Holidays | Local experts in Iceland vacations</A>
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland

输出：

<DT><A HREF="http://phabricator.org/" ADD_DATE="1412973315" LIST="Bookmarks", TAGS="bug,tracking">Phabricator</A>
<DD>
<DT><A HREF="http://qz.com/261426/the-hidden-commands-for-diagnosing-and-improving-your-netflix-streaming-quality/" ADD_DATE="1412838293" LIST="Inbox">The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz</A>
<DT><A HREF="http://www.farmholidays.is/" ADD_DATE="1412337977" LIST="Bookmarks", TAGS="iceland,tour,car,drive,self">Icelandic Farm Holidays | Local experts in Iceland vacations</A>
<DD> Self-driving tour of Iceland

使用awk在文件中移动数据块

2 个答案: