Question

这是我的问题。我有一篇包含32篇文章的复杂文本文件。由于每篇文章都以32个文档中的1个，32个文档中的2个等开头，因此使用以下代码我将文本拆分为不同的文章：

import re 
sections = [] 
current = []
with open("Aberdeen2005.txt") as f:
    for line in f:
        if re.search(r"(?i)\d+ of \d+ DOCUMENTS", line):        
           sections.append("".join(current))
           current = [line]
        else:
           current.append(line)

print(len(sections))

我现在正在尝试提取每篇文章的日期。我注意到日期是在每篇文章开头的第4行或第5行。因此，我设法通过以下方式创建一个具有相关线的函数：

def main():
    for i in range(len(sections)): 
        date_row4 = (sections[i].split("\n")[4].split(" "))     
        date_row5 = (sections[i].split("\n")[5].split(" "))

        print(date_row4)
        print(date_row5)

这让我得到以下列表：

我现在想要找到的是月份和年份，只能使用以下内容：

months = 'January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'

month = re.findall(r' \w+ months',date_row4 or date_row5)
year = re.findall(r' \d^20', date_row4 or date_row5 )

然而，不起作用。我刚刚开始学习Python，所以有很多东西我可以想象会出错。任何帮助都会非常感激。

亲切的问候，

安德烈

Answer 1

我不确定我是否会使用正则表达式。 time模块具有解析日期的工具。

>>> import time
>>> time.strptime('December 29, 2005 Thursday', "%B %d, %Y %A").tm_year
2005

如果您在一周中的某一天丢失了某些行，则可以使用try / except块，首先使用更常见的案例

Answer 2

IIUC，你的问题真的从＆＃34开始，它给我以下列表＆＃34;。（如果是这样的话，为什么要达到那个部分，如果我可能会问）。

虽然当然可以使用re非常精确地匹配您的确切模式，但我经常发现使用它的一小部分功能要容易得多。以下exp使用非常简单的正则表达式：

exp = re.compile(r'(\w+) (\d+), (\d+)')

可用于指定所需的常规形式，并可用作：

m = exp.search('December 29, 2005')
if m:
     m.groups() # This contains the match

如果您愿意，可以进一步检查匹配的月份是否在months变量中（如果您选择这样做，我会更改为set。）

Answer 3

只是尝试整理正则表达式，更容易解决的是年份。正则表达式必须按字符出现的顺序排列：

所以假设你的所有年份都在2000年之后，那么你的表达就是 '20 \ d \ d'

现在已经好几个月了不幸的是，你正在做的事情不起作用，你不能只使用正则表达式中的列表，但它很容易修复：

 months = ['January', 'February' ] # etc
 pattern = '|'.join(months) # this makes a string which would look like: January|February
 month = re.search(months, date_row4 or date_row5).group() # this will return a string instead of a list

虽然有更好的方法使用日期时间模块

Answer 4

import re
for section in sections: 
    date_row4 = section.split("\n")[4].split(" ")     
    date_row5 = section.split("\n")[5].split(" ")

    match = re.search(r'(\w+)\s+\d{1,2},\s+(\d{4})', date_row4)
    if not match:
        match = re.search(r'(\w+)\s+\d{1,2},\s+(\d{4})', date_row5)

    if match:
        month = match.group(1)
        year = match.group(2)

更新：尽管使用日期格式要好得多。

Python月份和年份搜索中的正则表达式

4 个答案: