Python Regex:遍历目录中每个文件的第一行

时间:2017-10-11 16:58:31

标签: python regex loops

我想循环浏览.txt文件并使用该文件第一行的日期(例如1993年4月1日)。

此代码有效,但匹配整个文件而不仅仅是第一行(注意:下面显示的代码Im显示的不仅仅是日期匹配循环):

以下脚本已更新且有效:

articles = glob.glob("*.txt")
y = 1

for f in articles:
    with open(f, "r") as content:
        wordcount = "x"
        lines = content.readlines()
        for line in lines :
            if line[0:7] == "LENGTH:":
                lineclean = re.sub('[#%&\<>*?:/{}$@+|=]', '', line)
                wordcount = lineclean[7:13]
                if wordcount[5] == "w":
                    wordcount = wordcount[0:4]
                elif wordcount[4] == "w":
                    wordcount = wordcount[0:3]
                elif wordcount[3] == "w":
                    wordcount =  wordcount[0:2]
                elif wordcount[2] == "w":
                    wordcount =  wordcount[0:1]
    with open(f, "r") as content:
        first_line = next(content)
        try:
            import re
            match = re.search('(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4}', first_line).group()
        except:
            pass           
        from dateutil import parser  
        parsed_pubdate = parser.parse(match).strftime('%Y-%m-%d')                  
    try:
        if wordcount != "x":
            move(f, "{parsed_pubdate}_{wordcount}_{source}.txt".format(**locals()))
        else:
            pass
    except OSError:
        pass
    y += 1
    content.close() 

为了仅匹配文件第一行中的日期,我添加了^\sflags=re.MULTILINE,因此我得到了:

match = re.search('^\s(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?
|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?
|Dec(ember)?)\s+\d{1,2},\s+\d{4}', line, flags=re.MULTILINE).group()

但是,现在程序只使用一个日期(文件夹中最后一个文件的日期)并将其用于每个文件(因此每个文件的日期都相同,而日期在原始.txt文件中有所不同)。

我忽略了此循环所属的整个步骤,但我的问题仅适用于正则表达式日期匹配循环。在此先感谢您的帮助!

1 个答案:

答案 0 :(得分:0)

articles = glob.glob("*.txt")
y = 1

for f in articles:
    with open(f, "r") as content:
        wordcount = "x"
        lines = content.readlines()
        for line in lines :
            if line[0:7] == "LENGTH:":
                lineclean = re.sub('[#%&\<>*?:/{}$@+|=]', '', line)
                wordcount = lineclean[7:13]
                if wordcount[5] == "w":
                    wordcount = wordcount[0:4]
                elif wordcount[4] == "w":
                    wordcount = wordcount[0:3]
                elif wordcount[3] == "w":
                    wordcount =  wordcount[0:2]
                elif wordcount[2] == "w":
                    wordcount =  wordcount[0:1]
    with open(f, "r") as content:
        first_line = next(content)
        try:
            import re
            match = re.search('(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4}', first_line).group()
        except:
            pass           
        from dateutil import parser  
        parsed_pubdate = parser.parse(match).strftime('%Y-%m-%d')                  
    try:
        if wordcount != "x":
            move(f, "{parsed_pubdate}_{wordcount}_{source}.txt".format(**locals()))
        else:
            pass
    except OSError:
        pass
    y += 1
    content.close()