查找缺失的日期

时间:2019-02-22 01:57:40

标签: python time-series

我正在尝试编写一个可以在数据框中找到缺少日期的函数。

这是我的情况: (数据按客户排序,然后按日期排序。  日期格式为:M / D / Y)

<!DOCTYPE HTML>
<html>

<head>
  <link rel="apple-touch-icon" sizes="180x180" href="images\free_horizontal_on_white_by_logaster.jpg">
  <link rel="icon" type="image/jpg" sizes="32x32" href="images\free_horizontal_on_white_by_logaster.jpg">
  <link rel="icon" type="image/jpg" sizes="16x16" href="images\free_horizontal_on_white_by_logaster.jpg">
  <meta name="msapplication-TileColor" content="#da532c">
  <meta name="theme-color" content="#ffffff">
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <meta http-equiv="X-UA-Compatible" content="ie=edge">
  <link rel="stylesheet" href="resolve.css">
  <title>Resolve - Real Women, Real Feedback</title>
</head>

<body>
  <header>
    <div class="container">
      <div id="branding">
        <a href="indexresolve.html"><img src="images/lasttry.png" alt="resolvelogo"></a>
      </div>
      <nav>
        <li><a href="indexresolve.html">Home</a></li>
        < <li><a href="howitworks.html">How It Works</a></li>
          <li><a href="contact.html">Contact</a></li>
          <li><a href="faq.html">FAQ</a></li>
          <li><button id="login" class="button">Log In</button></li>
          <div id="login-modal">
            <div id="login-content">
              <span class="close">&times;</span>
              <img id="login-logo" src="images\free_horizontal_on_white_by_logaster.jpg">
              <form>
                <input class="login-input" type="text" placeholder="username">
                <input class="login-input" type="password" placeholder="password">
                <button>Log In</button>
              </form>
              <p>By clicking log in, you agree to our <a href="terms.html">Terms</a>, <a href="privacy.html">Privacy Policy</a>, and our <a href="cookie.html">Cookie Policy</a>.</p>
            </div>
          </div>
      </nav>
  </header>
  <section>
    <div class="container2">
      <div>
        <h1>Guys</h1>
        <h2>fajfsda klfsdajfodisjflkd oisdjfklewjf oisdjfsakfj akfjfslkdja;fj sd;akfjdkfjsdakfj saifjsdakfs.</h2>
        <button>Get Started</button>
      </div>
      <div>
        <h1>Ladies</h1>
        <h2>dklasdjfs kdsjdlk jfsalkjf las;fjdaa fdaksjdk skjfsidjf akldfjskl fjsdlkfjskdlfjsdifjdkf dkfjsdijf s </h2>
        <button id="login">Get Started</button>
      </div>
      <div class="appad">
        <h2>App Coming Soon!</h2>
      </div>
    </div>
    <script src="resolve.js"></script>
</body>

</html>

该功能应读取“起始日期”和“截止日期”,并查看日期(每个客户)是否连续。然后,添加一列(“结果”)并显示结果。

该功能必须在每个客户上迭代。

(已添加评论)

请查看我的预期输出。我也在添加索引和一些解释: 索引[1]显示缺失,因为连续性被破坏,您可以通过比较To date [0]与From date [2]得出这个结论,这两个值不相同。另一方面:到date [2] =从date [4]开始,这就是为什么“结果”显示为Not Missing [3]。

From Date   To Date
Customer        
A   1/10/2017   2/9/2017
A   NaN         NaN
A   3/10/2017   4/9/2017
A   NaN         NaN
A   4/9/2017    5/9/2017
B   2/10/2017   3/9/2017
B   NaN         NaN
B   3/9/2017    4/9/2017

任何帮助将不胜感激。

1 个答案:

答案 0 :(得分:0)

pd.DataFrame.groupbypd.to_datetime一起使用:

df['From Date'] = pd.to_datetime(df['From Date'], format="%m/%d/%Y")
df['To Date'] = pd.to_datetime(df['To Date'], format="%m/%d/%Y")

dfs = []
for k, d in df.groupby('Customer'):
    dt = d.dropna()['To Date'].shift(1)[1:]
    res = []
    for i in range(dt.shape[0]):
        if (d['From Date'][dt.index] == dt).iloc[i]:
            res.append('Not Missing')
        else:
            res.append('Missing')
    for i in range(dt.shape[0]):
        dt.iloc[i] = res[i]
    dt.index -= 1
    dfs.append(pd.concat([d, dt], 1))
result = pd.concat(dfs)
print(result)

  Customer  From Date    To Date      To Date
0        A 2017-01-10 2017-02-09          NaN
1        A        NaT        NaT      Missing
2        A 2017-03-10 2017-04-09          NaN
3        A        NaT        NaT  Not Missing
4        A 2017-04-09 2017-05-09          NaN
5        B 2017-02-10 2017-03-09          NaN
6        B        NaT        NaT  Not Missing
7        B 2017-03-09 2017-04-09          NaN

最后:

df.columns = ['From Date', 'To Date', 'Results']
print(df)

  Customer  From Date    To Date      Results
0        A 2017-01-10 2017-02-09          NaN
1        A        NaT        NaT      Missing
2        A 2017-03-10 2017-04-09          NaN
3        A        NaT        NaT  Not Missing
4        A 2017-04-09 2017-05-09          NaN
5        B 2017-02-10 2017-03-09          NaN
6        B        NaT        NaT  Not Missing
7        B 2017-03-09 2017-04-09          NaN

说明:

  • pd.to_datetime:这是将您的近似日期数据转换为实际日期时间数据。这样,pandas可以进行一些计算(例如两天之间的diff)。由于它是串行操作,因此必须在每个所需的列上执行,而不是在整个数据帧上执行。
  • df.groupbygroupby返回以给定条件为键的类似dict的对象。由于整个计算都是在每个 Customer上完成的,因此请使用df.groupby('Customer')。
  • dt = d.dropna()['To Date'].shift(1)[1:]d是仅包含单个Customer数据的数据帧子集。 shift(1)在下面提供了数据帧移位1个单元格。这是为了简化To DateFrom Date之间的比较。
  • d['From Date'][dt.index] == dt:提供To DateFrom Date之间比较的布尔结果。
  • dt.iloc[i] = res[i]:拥有list的失踪和不失踪的邮件后,请将其分配回dt以创建Results列。
  • dfs.append(pd.concat([d, dt] 1)):将新创建的Results列与原始d连接起来,然后append合并到list
  • result = pd.concat(dfs)dfs现在包含每个Customer的子集数据帧。将它们连接到一个大数据框中。
  • result.columns = ['To Date', 'From Date', 'Results']:重新分配列名。