Pandas Dataframe:从时间戳列获取唯一值

时间:2019-02-20 20:06:50

标签: python pandas dataframe

我有以下时间序列数据:

     private async Task<DialogTurnResult> ValidationFirstStepAsync(
                WaterfallStepContext stepContext,
                CancellationToken cancellationToken = default(CancellationToken))
            {
                // Access the bot UserInfo accessor so it can be used to get state info.
                LanguageAccessor languageAccessor = await 
                _accessors.LanguageAccessor.GetAsync(stepContext.Context, null, 
                cancellationToken);

               if ((languageAccessor)stepContext.Context.Activity.Text)
               {             
                  await stepContext.Context.SendActivityAsync(
                            "Hi!");
                  return await stepContext.NextAsync();
               }
               else
               { 
                  await stepContext.Context.SendActivityAsync("Sorry, your language is not supported");
                  return await stepContext.EndDialogAsync(); }
               }
}

我想要的是一个列表年,

1998-01-02 09:30:00,0.4298,0.4337,0.4258,0.4317,6426369
1999-01-02 09:45:00,0.4317,0.4337,0.4258,0.4298,10589080
2000-01-02 10:00:00,0.4298,0.4337,0.4278,0.4337,9507980
2001-01-02 10:15:00,0.4337,0.4416,0.4298,0.4416,13639022

因此,我可以使用该列表来了解可以在该数据框中查询的年份。并非所有数据框都具有相同的年份。

years = list['1998'.'1999','2000','2001']

我正在尝试很多事情,但没有成功。有人可以向我解释如何解决这样的问题吗?

编辑1:根据一些建议,我正在这样做:

data = pd.read_csv(str(inFileName), index_col=0, parse_dates=True, header=None)

  #data.iloc[:, 0]

print(pd.DatetimeIndex(data.iloc[:, 0]).year)

  #print(data.iloc[:, 0])

  #years = list(data.index)
  #print(years)

  for x in years:

然后我得到列表:data = pd.read_csv(str(inFileName), parse_dates=[0], header=None) data.iloc[:, 0] = pd.to_datetime(data.iloc[:, 0]) data['year'] = data.iloc[:, 0].apply(lambda x: x.year) year_list = data['year'].unique().tolist() print(year_list) for x in year_list: newDF = data[x] newDF.head() print(newDF.head(5))

但是我不能从列表中创建一个新的数据框。我想为列表中的每个值创建一个新的数据框。我收到错误消息:

[2017, 2018, 2019]

编辑2

我正在使用这个:

[2017, 2018, 2019]

Traceback (most recent call last):
  File "/home/jason/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 2017

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./massageSM.py", line 123, in <module>
    main(sys.argv[1:])
  File "./massageSM.py", line 33, in main
    newDF = data[x]
  File "/home/jason/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2688, in __getitem__
    return self._getitem_column(key)
  File "/home/jason/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2695, in _getitem_column
    return self._get_item_cache(key)
  File "/home/jason/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 2489, in _get_item_cache
    values = self._data.get(item)
  File "/home/jason/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/internals.py", line 4115, in get
    loc = self.items.get_loc(item)
  File "/home/jason/Applications/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 2017

并产生输出:

data = pd.read_csv("RHE.SM", parse_dates=[0], header=None)
data.iloc[:, 0] = pd.to_datetime(data.iloc[:, 0])
data['year'] = data.iloc[:, 0].apply(lambda x: x.year)
year_list = data['year'].unique().tolist()
print(year_list)

for x in year_list:
    df = pd.DataFrame({'years':year_list})

    print(df.head(5))

但是我要创建的是: 仅 2017 的数据框 仅 2018 的数据框 仅 2019

的数据框

但是我不能对此进行硬编码,因为其他文件不会包含相同的年份。我需要列出可用的年份并进行迭代。

编辑3:

我也尝试过:

[2017, 2018, 2019]
   years
0   2017
1   2018
2   2019
   years
0   2017
1   2018
2   2019
   years
0   2017
1   2018
2   2019

我得到以下输出,该输出起初很好,但是创建 newDF 时出现错误。

data = pd.read_csv("RHE.SM", header=None, parse_dates=[0])
year_list = data[0].dt.year.unique().tolist()
print(year_list)
data.index = pd.DatetimeIndex(data[0])
print(type(data.index))
print(data.index)

for x in year_list:
    print(x)
    newDF = data[x]
    #newDF.head()

    #print(newDF.head(5))

4 个答案:

答案 0 :(得分:2)

我还没有测试过,但是我认为它会为您工作。

[Required

它首先将第一列转换为DateTime格式。然后,它将创建一个仅包含每个DateTime的年份组成部分的新列。最后,它将输出该列中每个唯一值的列表。

如果您还想将结果列表转换为新的数据框,只需在以下位置添加此行:

data.iloc[:, 0] = pd.to_datetime(data.iloc[:, 0])
data['year'] = data.iloc[:, 0].apply(lambda x: x.year)
year_list = data['year'].unique().tolist()

编辑:如果要将列表中的每个项目都转换为新的数据框,则可以添加以下内容:

df = pd.DataFrame({'years':year_list})

答案 1 :(得分:1)

如果您想按年份将一个数据框分成多个单独的数据框,则可以执行以下操作:

dfs = {
    year: sub_df.drop(columns=["year"])
    for year, sub_df in data.assign(year=lambda df: df[0].dt.year)\
                            .groupby("year")
}

出局:

{1998:                     0       1       2       3       4        5
 0 1998-01-02 09:30:00  0.4298  0.4337  0.4258  0.4317  6426369,
 1999:                     0       1       2       3       4         5
 1 1999-01-02 09:45:00  0.4317  0.4337  0.4258  0.4298  10589080,
 2000:                     0       1       2       3       4        5
 2 2000-01-02 10:00:00  0.4298  0.4337  0.4278  0.4337  9507980,
 2001:                     0       1       2       3       4         5
 3 2001-01-02 10:15:00  0.4337  0.4416  0.4298  0.4416  13639022}

如果要遍历并将单独的dfs写入单独的CSV,则可以执行以下操作:

for year, df in dfs.items():
    filename = "base_name_{}.csv".format(year)
    df.to_csv(filename, index=False)

原则上,您希望基于原始文件名的基名。

答案 2 :(得分:0)

最简单的情况是:

data = pd.read_csv(inFileName, header=None, parse_dates=[0])
data[0].dt.year.unique().tolist()

这利用了datetime accessor,它是快速且矢量化的

答案 3 :(得分:0)

首先,您需要确保您要从datetime类型提取年份。假设您知道列的名称以及存储日期的位置,请执行以下操作:

df['datetime'] = pd.to_datetime(df['datetime'])
df['year'] = df['datetime'].apply(lambda x: x.year)

如果日期在索引中,请执行以下操作:

df['datetime'] = pd.to_datetime(df.reset_index()['index'])
df['datetime'] = pd.to_datetime(df['datetime'])
df['year'] = df['datetime'].apply(lambda x: x.year)

第一行默认从索引中获取值并将其放入名为“索引”的列中。第二个将数据转换为datetime格式。

完成此操作后,您将提取唯一的年份:

years =  df['year'].unique().tolist()