Question

我正在使用正则表达式来匹配不规则数据集中的日期。某些日期可能未定义月份或日期，某些日期指定了2位数字，有些则指定了4位数字。我定义了一个运行良好的正则表达式，但是在很多行上，它有多个匹配项。我想定义一个lambda来选择最佳匹配，但是我不确定如何使用Python / Pandas。

这是我的正则表达式（我正在对其进行分段定义，以使其更易于理解和重复使用）：

year = "(?P<year>(?:(?:(?:20)|(?:19)?)\d{2})"
day = "(?P<day>[0123]?\d)"
month = "(?P<month>[01]?\d)"
separator = "[ /-]"

regex = f'(?:{month}{separator})?(?:{day}{separator})?{year})'

我正在从文件中读取数据并使用“ original_text”一列创建一个数据框：

with open('dates.txt') as file:
    for line in file:
        doc.append(line)

df = pd.DataFrame(doc, columns=['original_text'])

在这里，我正在查看与正则表达式匹配的部分结果：

print(df['original'].str.extractall(regex)[124:142])

         month  day  year
   match
67 0         7    6    91
   1       NaN  NaN    10
68 0         1   21    87
69 0       NaN  NaN    24
   1        11    3  1985
70 0         7   04    82
   1       NaN  NaN    90
   2       NaN  NaN    80
71 0         4   13    89
72 0       NaN  NaN    25
   1         7   11    77
   2       NaN  NaN    50
   3       NaN  NaN    28
   4       NaN  NaN    16
   5       NaN  NaN    83
   6        12  NaN    36
   7       NaN  NaN    20
   8         9   36    30

在许多情况下，extractall返回多个匹配项。例如，在第67行上，文本为：

91年7月6日SOS-10总得分：

此行有两个匹配项。第一个是我的首选比赛。在其他情况下，首选匹配项不是第一个，例如第69行：

24岁的右撇子妇女有1985年11/3 / s的右额大块切除术史，最近因R颅脑伤口紧急修补和放置L EVD导致视力下降...

第72行：

锂0.25（7/11/77）。 LFTS wnl。尿毒物阴性。血清毒素+氟西汀500;否则否定。 TSH 3.28。 BUN / Cr：16 / 0.83。脂质不明显。 B12 363，叶酸> 20。 CBC：4.9 / 36/308有关系统宪法的医学评论：

我想对一组匹配项应用一个函数，以选择最有可能构成一个正确日期的那个，只返回那个。最终结果如下所示：

   month  day  year
67     7    6    91
68     1   21    87
69    11    3  1985
70     7   04    82
71     4   13    89
72     7   11    77

此处为我要完成的操作提供的伪代码：

def get_best_match(matches) {
    match = < best match from matches>
    return match
}

dates = df['original'].str.extractall(regex).<lambda that calls get_best_match on each row of this set>

Python熊猫-从extractall返回的集合中选择最佳匹配

0 个答案: