我想找到数据集中每个数据块的开始索引和结束索引。 数据就像:
index A wanted_column1 wanted_column2
2000/1/1 0 0
2000/1/2 1 2000/1/2 1
2000/1/3 1 1
2000/1/4 1 1
2000/1/5 0 0
2000/1/6 1 2000/1/6 2
2000/1/7 1 2
2000/1/8 1 2
2000/1/9 0 0
如数据所示,index
和A
是给定的列,wanted_column1
和wanted_column2
是我想要的。
这个想法是有不同的连续数据块。我想检索每个数据块的起始索引,并想要增加数据中有多少块的计数。
我尝试使用shift(-1)
,但是无法区分开始索引和结束索引之间的差异。
答案 0 :(得分:0)
是您需要的吗?
index A wanted_column1 wanted_column2
0 2000/1/1 0 None 0
1 2000/1/2 1 2000/1/2 1
2 2000/1/3 1 None 1
3 2000/1/4 1 None 1
4 2000/1/5 0 None 0
5 2000/1/6 1 2000/1/6 2
6 2000/1/7 1 None 2
7 2000/1/8 1 None 2
8 2000/1/9 0 None 2
得出:
gehbiszumeis
编辑:性能比较
19.9 ms
的解决方案:my
4.07 ms
解决方案:--watchContentBase
答案 1 :(得分:0)
假设数据框为df
,则可以在df['A'] != 0
处找到索引。之前的索引是块的最后一个索引,在块的第一个索引之后。稍后,您对找到的索引数进行计数以计算数据块的数量
import pandas as pd
# Read your data
df = pd.read_csv('my_txt.txt', sep=',')
df['wanted_column1'] = None # creating already dummy columns
df['wanted_column2'] = None
# Find indices after each index, where 'A' is not 1, except of it is the last value
# of the dataframe
first = [x + 1 for x in df[df['A'] != 1].index.values if x != len(df)-1]
# Find indices before each index, where 'A' is not 1, except of it is the first value
# of the dataframe
last = [x - 1 for x in df[df['A'] != 1].index.values if x != 0]
# Set the first indices of each chunk at its corresponding position in your dataframe
df.loc[first, 'wanted_column1'] = df.loc[first, 'index']
# You can set also the last indices of each chunk (you only mentioned this in the text,
# not in your expected-result-listed). Uncomment for last indices.
# df.loc[last, 'wanted_column1'] = df.loc[last, 'index']
# Count the number of chunks and fill it to wanted_column2
for i in df.index: df.loc[i, 'wanted_column2'] = sum(df.loc[:i, 'wanted_column1'].notna())
# Some polishing of the df after to match your expected result
df.loc[df['A'] != 1, 'wanted_column2'] = 0
这给
index A wanted_column1 wanted_column2
0 2000/1/1 0 None 0
1 2000/1/2 1 2000/1/2 1
2 2000/1/3 1 None 1
3 2000/1/4 1 None 1
4 2000/1/5 0 None 0
5 2000/1/6 1 2000/1/6 2
6 2000/1/7 1 None 2
7 2000/1/8 1 None 2
8 2000/1/9 0 None 0
并适用于df
的所有长度和数据中的块数