Python - Pandas索引和选择

时间:2017-06-18 18:01:45

标签: python pandas numpy indexing

我正在尝试让pandas从下面的结构化csv中选择“ClosePrice”下的行范围,并将其存储在数据帧中。该文件有许多标识符,但我只想通过下面列表中的标识符来浏览该文件。行数也并不总是相同。

list = ['ABC0123', 'DEF0123']

>  Column 1  Column 2   Column 3    Column 4   Column 5   Column 6   Column 7
>   "Date"   20170101 "Identifier"   ABC0123
> "OpenPrice"   500     "Currency"      USD
> "ClosePrice"  550       "foo"         bar
>     foo       foo        foo          foo       foo       foo        foo          
>     foo       foo        foo          foo       foo       foo        foo      
>     foo       foo        foo          foo       foo       foo        foo
>   "Date"   20170101 "Identifier"   SOMEOTHER
>     ...
>     ...
>     ...
>   "Date"   20170101 "Identifier"   DEF0123
> "OpenPrice"  600     "Currency"      USD
> "ClosePrice" 650       "foo"         bar
>    foo       foo        foo          foo       foo       foo        foo          
>    foo       foo        foo          foo       foo       foo        foo      
>    foo       foo        foo          foo       foo       foo        foo    
>    foo       foo        foo          foo       foo       foo        foo          
>    foo       foo        foo          foo       foo       foo        foo      
>    foo       foo        foo          foo       foo       foo        foo    
>    foo       foo        foo          foo       foo       foo        foo          
>    foo       foo        foo          foo       foo       foo        foo      
>    foo       foo        foo          foo       foo       foo        foo

我使用for-i-loop获取了我感兴趣的每个表的第一行,并且:

df.iloc[df[df['Column 4'].isin(list)].index + 3,:]

以“foo”值进入左上角的单元格并选择整行,但我想弄清楚如何选择起点下面的行并在下一行之前停止

"Date"   20170101 "Identifier"   SOMEOTHER

我正在考虑的一种方法是检查第5列中最后一行下的单元格值的len,这将是= 0,但我无法使用脚本重现此逻辑。其他方法非常受欢迎。

1 个答案:

答案 0 :(得分:1)

首先不要使用list作为变量,因为masking内置函数。

创建帮助列g,以区分具有cumsum唯一编号的所有组。然后获取包含L值的所有组,并按另一个isin选择所有行:

L = ['ABC0123', 'DEF0123']
df['g'] = df['Column 1'].eq('Date').cumsum()
vals = df.loc[df['Column 4'].isin(L), 'g']
df = df[df['g'].isin(vals)]
print (df)
      Column 1  Column 2    Column 3 Column 4 Column 5 Column 6 Column 7  g
0         Date  20170101  Identifier  ABC0123      NaN      NaN      NaN  1
1    OpenPrice       500    Currency      USD      NaN      NaN      NaN  1
2   ClosePrice       550         foo      bar      NaN      NaN      NaN  1
3          foo       foo         foo      foo      foo      foo      foo  1
4          foo       foo         foo      foo      foo      foo      foo  1
5          foo       foo         foo      foo      foo      foo      foo  1
9         Date  20170101  Identifier  DEF0123      NaN      NaN      NaN  3
10   OpenPrice       600    Currency      USD      NaN      NaN      NaN  3
11  ClosePrice       650         foo      bar      NaN      NaN      NaN  3
12         foo       foo         foo      foo      foo      foo      foo  3
13         foo       foo         foo      foo      foo      foo      foo  3

如有必要,请删除g列:

df = df.drop('g', axis=1)

使用index的类似解决方案:

L = ['ABC0123', 'DEF0123']
df.index = df['Column 1'].eq('Date').cumsum()
vals = df.index[df['Column 4'].isin(L)]
df = df.loc[vals].reset_index(drop=True)
print (df)
      Column 1  Column 2    Column 3 Column 4 Column 5 Column 6 Column 7
0         Date  20170101  Identifier  ABC0123      NaN      NaN      NaN
1    OpenPrice       500    Currency      USD      NaN      NaN      NaN
2   ClosePrice       550         foo      bar      NaN      NaN      NaN
3          foo       foo         foo      foo      foo      foo      foo
4          foo       foo         foo      foo      foo      foo      foo
5          foo       foo         foo      foo      foo      foo      foo
6         Date  20170101  Identifier  DEF0123      NaN      NaN      NaN
7    OpenPrice       600    Currency      USD      NaN      NaN      NaN
8   ClosePrice       650         foo      bar      NaN      NaN      NaN
9          foo       foo         foo      foo      foo      foo      foo
10         foo       foo         foo      foo      foo      foo      foo

编辑:

L1 = ['Date','OpenPrice','ClosePrice']
L = ['ABC0123', 'DEF0123']

#if necessary filter rows by L1 
df = df[df['Column 1'].isin(L1)]
df['g'] = df['Column 1'].eq('Date').cumsum()
vals = df.loc[df['Column 4'].isin(L), 'g']
df = df[df['g'].isin(vals)]
print (df)
      Column 1  Column 2    Column 3 Column 4 Column 5 Column 6 Column 7  g
0         Date  20170101  Identifier  ABC0123      NaN      NaN      NaN  1
1    OpenPrice       500    Currency      USD      NaN      NaN      NaN  1
2   ClosePrice       550         foo      bar      NaN      NaN      NaN  1
9         Date  20170101  Identifier  DEF0123      NaN      NaN      NaN  3
10   OpenPrice       600    Currency      USD      NaN      NaN      NaN  3
11  ClosePrice       650         foo      bar      NaN      NaN      NaN  3

对于小组工作,可以groupbyflexible apply

一起使用Final code
def f(x):
    print (x)
    #some another code
    return x

df1 = df.groupby('g').apply(f)
print (df1)

编辑:

https://github.com/sokhasen/ViewerPDF.git使用真实数据:

 L1 = ["Date", "OpenPrice", "ClosePrice"] 
 g = 1 
 for i in list:
     df['g'] = df['Column 4'].isin(list).cumsum() 
     vals = df.loc[df['Column 4'].isin(list), 'g'] 
     df = df[df['g'].isin(vals)] 
     dfFinal = df.loc[(dfLux['g'] == g) & ~df['Column 1'].isin(L1)] 
     g=g+1