我想知道如何根据几个不同的类别找到估计值。其中两列是分类的,其中一列包含两个感兴趣的字符串,最后一列包含数值 我有一个名为sports.csv的csv文件
import pandas as pd
import numpy as np
#loading the data into data frame
df = pd.read_csv('sports.csv')
我正在尝试为price
找到一个同时具有棒球和篮球的Gym
以及enrollment
从240到260,因为它们来自region
4和type
1
Region Type enroll estimates price Gym
2 1 377 0.43 40 Football|Baseball|Hockey|Running|Basketball|Swimming|Cycling|Volleyball|Tennis|Ballet
4 2 100 0.26 37 Baseball|Tennis
4 1 347 0.65 61 Basketball|Baseball|Ballet
4 1 264 0.17 12 Swimming|Ballet|Cycling|Basketball|Volleyball|Hockey|Running|Tennis|Baseball|Football
1 1 286 0.74 78 Swimming|Basketball
0 1 210 0.13 29 Baseball|Tennis|Ballet|Cycling|Basketball|Football|Volleyball|Swimming
0 1 263 0.91 31 Tennis
2 2 271 0.39 54 Tennis|Football|Ballet|Cycling|Running|Swimming|Baseball|Basketball|Volleyball
3 3 247 0.51 33 Baseball|Hockey|Swimming|Cycling
0 1 109 0.12 17 Football|Hockey|Volleyball
我不知道如何将所有东西拼凑在一起。如果语法不正确,我道歉,我刚刚开始使用Python。到目前为止,我有:
import pandas as pd
import numpy as np
#loading the data into data frame
df = pd.read_csv('sports.csv')
#group 4th region and type 1 together where enrollment is in between 240 and 260
group = df[df['Region'] == 4] df[df['Type'] == 1] df[240>=df['Enrollment'] <=260 ]
#split by pipe chars to find gyms that contain both Baseball and Basketball
df['Gym'] = df['Gym'].str.split('|')
df['Gym'] = df['Gym'].str.contains('Baseball'& 'Basketball')
price = df.loc[df['Gym'], 'Price']
我应该做一个小组吗?如果是这样,我如何包含Type
== 1 Region
== 4列和240到260的注册?
答案 0 :(得分:0)
您可以创建一个mask
并指定所有条件,然后使用掩码进行子集化:
mask = (df['Region'] == 4) & (df['Type'] == 1) & \
(df['enroll'] <= 260) & (df['enroll'] >= 240) & \
df['Gym'].str.contains('Baseball') & df['Gym'].str.contains('Basketball')
df['price'][mask]
# Series([], name: price, dtype: int64)
返回空,因为没有记录满足上述所有条件。
答案 1 :(得分:0)
我必须添加一个实际符合您标准的实例,否则您将得到一个空结果。您希望将df.loc
用于以下条件:
In [1]: import pandas as pd, numpy as np, io
In [2]: in_string = io.StringIO("""Region Type enroll estimates price Gym
...: 2 1 377 0.43 40 Football|Baseball|Hockey|Running|Basketball|Swimming|Cycling|Volleyball|Tennis|Ballet
...: 4 2 100 0.26 37 Baseball|Tennis
...: 4 1 247 0.65 61 Basketball|Baseball|Ballet
...: 4 1 264 0.17 12 Swimming|Ballet|Cycling|Basketball|Volleyball|Hockey|Running|Tennis|Baseball|Football
...: 1 1 286 0.74 78 Swimming|Basketball
...: 0 1 210 0.13 29 Baseball|Tennis|Ballet|Cycling|Basketball|Football|Volleyball|Swimming
...: 0 1 263 0.91 31 Tennis
...: 2 2 271 0.39 54 Tennis|Football|Ballet|Cycling|Running|Swimming|Baseball|Basketball|Volleyball
...: 3 3 247 0.51 33 Baseball|Hockey|Swimming|Cycling
...: 0 1 109 0.12 17 Football|Hockey|Volleyball""")
In [3]: df = pd.read_csv(in_string,delimiter=r"\s+")
In [4]: df.loc[df.Gym.str.contains(r"(?=.*Baseball)(?=.*Basketball)")
...: & (df.enroll <= 260) & (df.enroll >= 240)
...: & (df.Region == 4) & (df.Type == 1), 'price']
Out[4]:
2 61
Name: price, dtype: int64
注意我使用了一个正则表达式模式for contains,它实际上充当了正则表达式的AND运算符。你可以简单地为篮球和棒球做另外的.contains
条件。