我想修改此python脚本以输出已解析文件的修改日期及其标题

时间:2015-02-23 19:02:03

标签: python beautifulsoup

所以最后一个程序员给我留下了这个脚本,它抓住了所有旧内容,并将其全部写成" partial' file ..删除所有容器html,只留下文章本身的html。然后它将清单写为JSON,它给出了所有文件的URL,标题和创建日期。

我希望按照创建日期,清单和创建的文件名中的顺序排序所有文件。 (类似于[unix-create-date] - [url] .partial])。然后我可以使用清单文件按照创建顺序列出文件,并使用文件名本身。

我不懂python,所以我不知道如何在那里获取文件修改日期。感谢您的回复!

这是完整的脚本。

#!/usr/bin/python

import os
import re
from BeautifulSoup import BeautifulSoup
import simplejson as json

def parse_article(root, filename):
    path = os.path.join(root, filename)
    abs_path = os.path.abspath(path)
    try:
        article = open(abs_path, 'rU')
        html = article.read()
        article.close()
    except IOError:
        print "Cannot open article: %s" % path

    url = "/%s" % path
    soup = BeautifulSoup(html)

    title = None
    fallbacks = ['h1', 'h2', 'h3', 'title']
    for fallback in fallbacks:
        if title is None:
            title = soup.find(fallback)
        else:
            break

    content = u"" if soup.body is None else soup.body.renderContents()
    save_file(root, "%s.partial" % filename, content)

    title = u"" if title is None else title.renderContents()
    return unicode(url), title

def process_folder(path):
    files = os.listdir(path)
    articles = filter(lambda name: not name.startswith('index.') and (name.endswith('.html') or name.endswith('.htm')), files)
    manifest = {}

    for article in articles:
        url, title = parse_article(path, article)
        manifest[url] = title

    return manifest

def save_json(root, name, obj):
    if len(obj.keys()) == 0:
        return

    path = os.path.join(root, name)
    manifest = open(path, 'w')
    json.dump(obj, manifest)
    manifest.close()
    print "Wrote %s" % path

def save_file(root, name, content):
    path = os.path.join(root, name)
    manifest = open(path, 'w')
    manifest.write(content)
    manifest.close()
    print "Wrote %s" % path

def process(root):
    root = os.path.abspath(root)
    root_re = '^%s[/]*' % root
    for dirname, dirnames, filenames in os.walk(root):
        dirname = re.sub(root_re, '', dirname)
        if len(dirname) > 0:
            manifest = process_folder(dirname)
            abs_path = os.path.abspath(os.path.join(root, dirname))
            save_json(abs_path, "manifest.json", manifest)


if __name__ == "__main__":
    process('.')

1 个答案:

答案 0 :(得分:0)

您可以使用os.stat

获取文件的修改日期
>>> import os, time
>>> result = os.stat("/tmp/z.py")
>>> result
posix.stat_result(st_mode=33188, st_ino=6034492, st_dev=16777220L, st_nlink=1, st_uid=501, st_gid=0, st_size=189, st_atime=1424735724, st_mtime=1424735651, st_ctime=1424735651)
>>> print "Modification date: %s -> %s" % (result.st_mtime, time.ctime(result.st_mtime))
Modification date: 1424735651.0 -> Mon Feb 23 15:54:11 2015
>>> print "Creation date: %s -> %s" % (result.st_ctime, time.ctime(result.st_ctime))
Creation date: 1424735651.0 -> Mon Feb 23 15:54:11 2015
>>> print "Access date: %s -> %s" % (result.st_atime, time.ctime(result.st_atime))
Access date: 1424735724.0 -> Mon Feb 23 15:55:24 2015

因此,在您的代码中,您可能希望将其存储在此处的清单中:

    ...
    for article in articles:
        url, title = parse_article(path, article)
        manifest[url] = title
        manifest[ctime] = os.stat(path).st_ctime
    ...

使用该信息,您可以根据ctime对文件进行排序,或将其转换为datetime对象等。