我想知道:
1。)如何在列中找到特定字符串
2.)给定该字符串,我如何找到它的相应最大值
3.)如何计算该列中每行的字符串数
我有一个名为sports.csv的
的csv文件 import pandas as pd
import numpy as np
#loading the data into data frame
X = pd.read_csv('sports.csv')
感兴趣的两列是Totals
和Gym
列:
Total Gym
40 Football|Baseball|Hockey|Running|Basketball|Swimming|Cycling|Volleyball|Tennis|Ballet
37 Baseball|Tennis
61 Basketball|Baseball|Ballet
12 Swimming|Ballet|Cycling|Basketball|Volleyball|Hockey|Running|Tennis|Baseball|Football
78 Swimming|Basketball
29 Baseball|Tennis|Ballet|Cycling|Basketball|Football|Volleyball|Swimming
31 Tennis
54 Tennis|Football|Ballet|Cycling|Running|Swimming|Baseball|Basketball|Volleyball
33 Baseball|Hockey|Swimming|Cycling
17 Football|Hockey|Volleyball
请注意,Gym
列有每个相应运动的多个字符串。我正在寻找一种方法来查找所有拥有棒球的健身房并找到具有最大总数的健身房。但是,我只对至少有两项其他运动的健身房感兴趣,即我不想考虑:
Total Gym
37 Baseball|Tennis
答案 0 :(得分:1)
您可以使用pandas
首先,将字符串拆分为制表符分隔符上的列表,然后迭代列表并选择长度大于2的字符串,因为您希望将棒球与其他两项运动作为标准。
In [4]: df['Gym'] = df['Gym'].str.split('|').apply(lambda x: ' '.join([i for i in x if len(x)>2]))
In [5]: df
Out[5]:
Total Gym
0 40 Football Baseball Hockey Running Basketball Sw...
1 37
2 61 Basketball Baseball Ballet
3 12 Swimming Ballet Cycling Basketball Volleyball ...
4 78
5 29 Baseball Tennis Ballet Cycling Basketball Foot...
6 31
7 54 Tennis Football Ballet Cycling Running Swimmin...
8 33 Baseball Hockey Swimming Cycling
9 17 Football Hockey Volleyball
使用str.contains
在Baseball
列中搜索字符串Gym
。
In [6]: df = df.loc[df['Gym'].str.contains('Baseball')]
In [7]: df
Out[7]:
Total Gym
0 40 Football Baseball Hockey Running Basketball Sw...
2 61 Basketball Baseball Ballet
3 12 Swimming Ballet Cycling Basketball Volleyball ...
5 29 Baseball Tennis Ballet Cycling Basketball Foot...
7 54 Tennis Football Ballet Cycling Running Swimmin...
8 33 Baseball Hockey Swimming Cycling
计算相应的字符串计数。
In [8]: df['Count'] = df['Gym'].str.split().apply(lambda x: len([i for i in x]))
然后选择与Totals
列中的最大值对应的数据框子集。
In [9]: df.loc[df['Total'].idxmax()]
Out[9]:
Total 61
Gym Basketball Baseball Ballet
Count 3
Name: 2, dtype: object
答案 1 :(得分:0)
您可以在阅读文件时一次性完成:
import csv
with open("sport.csv") as f:
mx, best = float("-inf"), None
for row in csv.reader(f, delimiter=" ", skipinitialspace=1):
row[1:] = row[1].split("|")
if "Baseball" in row and len(row[1:]) > 2 and int(row[0]) > mx:
mx = int(row[0])
best = row
if best:
print(best, mx, len(row[1:]))
哪会给你:
(['61', 'Basketball', 'Baseball', 'Ballet'], 61, 3)
不分裂的另一种方法是计算管道字符:
import csv
with open("sports.csv") as f:
mx, best = float("-inf"),None
for row in csv.reader(f, delimiter=" ", skipinitialspace=1):
print(row[1])
if "Baseball" in row[1] and row[1].count("|") > 1 and int(row[0]) > mx:
mx = int(row[0])
best = row
if best:
print(best, mx, row[1].count("|"))
这意味着虽然子字符串可能匹配而不是精确的字。
答案 2 :(得分:0)
试试这个:
df3.loc[(df3['Gym'].str.contains('Hockey') == True) & (df3["Gym"].str.count("\|")>1)].sort_values("Total").tail(1)
Total Gym
0 40 Football|Baseball|Hockey|Running|Basketball|Sw...
df3.loc[(df3['Gym'].str.contains('Baseball') == True) & (df3["Gym"].str.count("\|")>1)].sort_values("Total").tail(1)
Total Gym
2 61 Basketball|Baseball|Ballet