我正在将我的书签从kippt.com移到pinboard.in。
我从Kippt导出书签,出于某种原因,他们在相同的字段中存储标签(以#开头)和描述。 Pinboard保持标签和描述分开。
这是出口后Kippt书签的样子:
<DT><A HREF="http://www.example.org/" ADD_DATE="1412337977" LIST="Bookmarks">This is a title</A>
<DD>#tag1 #tag2 This is a description
这是导入Pinboard之前的样子:
<DT><A HREF="http://www.example.org/" ADD_DATE="1412337977" LIST="Bookmarks" TAGS="tag1,tag2">This is a title</A>
<DD>This is a description
基本上,我需要将#tag1 #tag2
替换为TAGS="tag1,tag2"
并将其移至<A>
内的第一行。
我一直在阅读有关移动数据块的信息:sed or awk to move one chunk of text betwen first pattern pair into second pair?
到目前为止,我还没有想出一个好的食谱。有什么见解吗?编辑:
以下是输入文件外观的实际示例(3500个中的3个条目):
<DT><A HREF="http://phabricator.org/" ADD_DATE="1412973315" LIST="Bookmarks">Phabricator</A>
<DD>#bug #tracking
<DT><A HREF="http://qz.com/261426/the-hidden-commands-for-diagnosing-and-improving-your-netflix-streaming-quality/" ADD_DATE="1412838293" LIST="Inbox">The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz</A>
<DT><A HREF="http://www.farmholidays.is/" ADD_DATE="1412337977" LIST="Bookmarks">Icelandic Farm Holidays | Local experts in Iceland vacations</A>
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
答案 0 :(得分:0)
这可能不是最美丽的解决方案,但因为它似乎是一次性的事情应该足够了。
import re
dt = re.compile('^<DT>')
dd = re.compile('^<DD>')
with open('bookmarks.xml', 'r') as f:
for line in f:
if re.match(dt, line):
current_dt = line.strip()
elif re.match(dd, line):
current_dd = line
tags = [w for w in line[4:].split(' ') if w.startswith('#')]
current_dt = re.sub('(<A[^>]+)>', '\\1 TAGS="' + ','.join([t[1:] for t in tags]) + '">', current_dt)
for t in tags:
current_dd = current_dd.replace(t + ' ', '')
if current_dd.strip() == '<DD>':
current_dd = ""
else:
print current_dt
print current_dd
current_dt = ""
current_dd = ""
print current_dt
print current_dd
如果代码的某些部分不清楚,请告诉我。您当然可以使用python将行写入文件而不是打印它们,甚至可以修改原始文件。
编辑:添加了if子句,以便空的<DD>
行不会显示在结果中。
答案 1 :(得分:0)
script.awk
BEGIN{FS="#"}
/^<DT>/{
if(d==1) print "<DT>"s # for printing lines with no tags
s=substr($0,5);tags="" # Copying the line after "<DT>". You'll know why
d=1
}
/^<DD>/{
d=0
m=match(s,/>/) # Find the end of the HREF descritor first match of ">"
for(i=2;i<=NF;i++){sub(/ $/,"",$i);tags=tags","$i} # Concatenate tags
td=match(tags,/ /) # Parse for tag description (marked by a preceding space).
if(td==0){ # No description exists
tags=substr(tags,2)
tagdes=""
}
else{ # Description exists
tagdes=substr(tags,td)
tags=substr(tags,2,td-2)
}
print "<DT>" substr(s,1,m-1) ", TAGS=\"" tags "\"" substr(s,m)
print "<DD>" tagdes
}
awk -f script.awk kippt > pinboard
INPUT
<DT><A HREF="http://phabricator.org/" ADD_DATE="1412973315" LIST="Bookmarks">Phabricator</A>
<DD>#bug #tracking
<DT><A HREF="http://qz.com/261426/the-hidden-commands-for-diagnosing-and-improving-your-netflix-streaming-quality/" ADD_DATE="1412838293" LIST="Inbox">The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz</A>
<DT><A HREF="http://www.farmholidays.is/" ADD_DATE="1412337977" LIST="Bookmarks">Icelandic Farm Holidays | Local experts in Iceland vacations</A>
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
输出:
<DT><A HREF="http://phabricator.org/" ADD_DATE="1412973315" LIST="Bookmarks", TAGS="bug,tracking">Phabricator</A>
<DD>
<DT><A HREF="http://qz.com/261426/the-hidden-commands-for-diagnosing-and-improving-your-netflix-streaming-quality/" ADD_DATE="1412838293" LIST="Inbox">The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz</A>
<DT><A HREF="http://www.farmholidays.is/" ADD_DATE="1412337977" LIST="Bookmarks", TAGS="iceland,tour,car,drive,self">Icelandic Farm Holidays | Local experts in Iceland vacations</A>
<DD> Self-driving tour of Iceland