查找列中的特定字符串并查找与该字符串对应的最大值

时间:2016-07-30 18:59:15

标签: python csv pandas max

我想知道:

1。)如何在列中找到特定字符串
2.)给定该字符串,我如何找到它的相应最大值 3.)如何计算该列中每行的字符串数

我有一个名为sports.csv的

的csv文件
 import pandas as pd
 import numpy as np

#loading the data into data frame
X = pd.read_csv('sports.csv')

感兴趣的两列是TotalsGym列:

 Total  Gym
40  Football|Baseball|Hockey|Running|Basketball|Swimming|Cycling|Volleyball|Tennis|Ballet
37  Baseball|Tennis
61  Basketball|Baseball|Ballet
12  Swimming|Ballet|Cycling|Basketball|Volleyball|Hockey|Running|Tennis|Baseball|Football
78  Swimming|Basketball
29  Baseball|Tennis|Ballet|Cycling|Basketball|Football|Volleyball|Swimming
31  Tennis
54  Tennis|Football|Ballet|Cycling|Running|Swimming|Baseball|Basketball|Volleyball
33  Baseball|Hockey|Swimming|Cycling
17  Football|Hockey|Volleyball

请注意,Gym列有每个相应运动的多个字符串。我正在寻找一种方法来查找所有拥有棒球的健身房并找到具有最大总数的健身房。但是,我只对至少有两项其他运动的健身房感兴趣,即我不想考虑:

  Total   Gym
  37    Baseball|Tennis

3 个答案:

答案 0 :(得分:1)

您可以使用pandas

轻松完成此操作

首先,将字符串拆分为制表符分隔符上的列表,然后迭代列表并选择长度大于2的字符串,因为您希望将棒球与其他两项运动作为标准。

In [4]: df['Gym'] = df['Gym'].str.split('|').apply(lambda x: ' '.join([i for i in x if len(x)>2]))

In [5]: df
Out[5]: 
   Total                                                Gym
0     40  Football Baseball Hockey Running Basketball Sw...
1     37                                                   
2     61                         Basketball Baseball Ballet
3     12  Swimming Ballet Cycling Basketball Volleyball ...
4     78                                                   
5     29  Baseball Tennis Ballet Cycling Basketball Foot...
6     31                                                   
7     54  Tennis Football Ballet Cycling Running Swimmin...
8     33                   Baseball Hockey Swimming Cycling
9     17                         Football Hockey Volleyball

使用str.containsBaseball列中搜索字符串Gym

In [6]: df = df.loc[df['Gym'].str.contains('Baseball')]

In [7]: df
Out[7]: 
   Total                                                Gym
0     40  Football Baseball Hockey Running Basketball Sw...
2     61                         Basketball Baseball Ballet
3     12  Swimming Ballet Cycling Basketball Volleyball ...
5     29  Baseball Tennis Ballet Cycling Basketball Foot...
7     54  Tennis Football Ballet Cycling Running Swimmin...
8     33                   Baseball Hockey Swimming Cycling

计算相应的字符串计数。

In [8]: df['Count'] = df['Gym'].str.split().apply(lambda x: len([i for i in x]))

然后选择与Totals列中的最大值对应的数据框子集。

In [9]: df.loc[df['Total'].idxmax()]
Out[9]: 
Total                            61
Gym      Basketball Baseball Ballet
Count                             3
Name: 2, dtype: object

答案 1 :(得分:0)

您可以在阅读文件时一次性完成:

import csv
with open("sport.csv") as f:
    mx, best = float("-inf"), None
    for row in csv.reader(f, delimiter=" ", skipinitialspace=1):
        row[1:] = row[1].split("|")
        if "Baseball" in row and len(row[1:]) > 2 and int(row[0]) > mx:
            mx = int(row[0])
            best = row
    if best:
        print(best, mx, len(row[1:]))

哪会给你:

(['61', 'Basketball', 'Baseball', 'Ballet'], 61, 3)

不分裂的另一种方法是计算管道字符:

import csv
with open("sports.csv") as f:
    mx, best = float("-inf"),None
    for row in csv.reader(f, delimiter=" ", skipinitialspace=1):
        print(row[1])
        if "Baseball" in row[1] and row[1].count("|") > 1 and int(row[0]) > mx:
            mx = int(row[0])
            best = row
    if best:
        print(best, mx, row[1].count("|"))

这意味着虽然子字符串可能匹配而不是精确的字。

答案 2 :(得分:0)

试试这个:

df3.loc[(df3['Gym'].str.contains('Hockey') == True) & (df3["Gym"].str.count("\|")>1)].sort_values("Total").tail(1)

 Total                                                Gym
0     40  Football|Baseball|Hockey|Running|Basketball|Sw...


df3.loc[(df3['Gym'].str.contains('Baseball') == True) & (df3["Gym"].str.count("\|")>1)].sort_values("Total").tail(1)

   Total                         Gym
2     61  Basketball|Baseball|Ballet