所以最后一个程序员给我留下了这个脚本,它抓住了所有旧内容,并将其全部写成" partial' file ..删除所有容器html,只留下文章本身的html。然后它将清单写为JSON,它给出了所有文件的URL,标题和创建日期。
我希望按照创建日期,清单和创建的文件名中的顺序排序所有文件。 (类似于[unix-create-date] - [url] .partial])。然后我可以使用清单文件按照创建顺序列出文件,并使用文件名本身。
我不懂python,所以我不知道如何在那里获取文件修改日期。感谢您的回复!
这是完整的脚本。
#!/usr/bin/python
import os
import re
from BeautifulSoup import BeautifulSoup
import simplejson as json
def parse_article(root, filename):
path = os.path.join(root, filename)
abs_path = os.path.abspath(path)
try:
article = open(abs_path, 'rU')
html = article.read()
article.close()
except IOError:
print "Cannot open article: %s" % path
url = "/%s" % path
soup = BeautifulSoup(html)
title = None
fallbacks = ['h1', 'h2', 'h3', 'title']
for fallback in fallbacks:
if title is None:
title = soup.find(fallback)
else:
break
content = u"" if soup.body is None else soup.body.renderContents()
save_file(root, "%s.partial" % filename, content)
title = u"" if title is None else title.renderContents()
return unicode(url), title
def process_folder(path):
files = os.listdir(path)
articles = filter(lambda name: not name.startswith('index.') and (name.endswith('.html') or name.endswith('.htm')), files)
manifest = {}
for article in articles:
url, title = parse_article(path, article)
manifest[url] = title
return manifest
def save_json(root, name, obj):
if len(obj.keys()) == 0:
return
path = os.path.join(root, name)
manifest = open(path, 'w')
json.dump(obj, manifest)
manifest.close()
print "Wrote %s" % path
def save_file(root, name, content):
path = os.path.join(root, name)
manifest = open(path, 'w')
manifest.write(content)
manifest.close()
print "Wrote %s" % path
def process(root):
root = os.path.abspath(root)
root_re = '^%s[/]*' % root
for dirname, dirnames, filenames in os.walk(root):
dirname = re.sub(root_re, '', dirname)
if len(dirname) > 0:
manifest = process_folder(dirname)
abs_path = os.path.abspath(os.path.join(root, dirname))
save_json(abs_path, "manifest.json", manifest)
if __name__ == "__main__":
process('.')
答案 0 :(得分:0)
您可以使用os.stat
:
>>> import os, time
>>> result = os.stat("/tmp/z.py")
>>> result
posix.stat_result(st_mode=33188, st_ino=6034492, st_dev=16777220L, st_nlink=1, st_uid=501, st_gid=0, st_size=189, st_atime=1424735724, st_mtime=1424735651, st_ctime=1424735651)
>>> print "Modification date: %s -> %s" % (result.st_mtime, time.ctime(result.st_mtime))
Modification date: 1424735651.0 -> Mon Feb 23 15:54:11 2015
>>> print "Creation date: %s -> %s" % (result.st_ctime, time.ctime(result.st_ctime))
Creation date: 1424735651.0 -> Mon Feb 23 15:54:11 2015
>>> print "Access date: %s -> %s" % (result.st_atime, time.ctime(result.st_atime))
Access date: 1424735724.0 -> Mon Feb 23 15:55:24 2015
因此,在您的代码中,您可能希望将其存储在此处的清单中:
...
for article in articles:
url, title = parse_article(path, article)
manifest[url] = title
manifest[ctime] = os.stat(path).st_ctime
...
使用该信息,您可以根据ctime
对文件进行排序,或将其转换为datetime
对象等。