用熊猫过滤CSV

时间:2020-10-20 01:02:04

标签: python pandas csv

我在过滤此CSV文件时遇到困难。

以下是csv表中的一些条目:

Name      Info                                 Bio
Alice     Woman: 21y (USA)                     Actress
Breonna   Woman: (France)                      Singer
Carla     Woman: 30y (Trinidad and Tobago)     Actress
Diana     Woman: (USA)                         Singer

我正在尝试过滤“信息”行以获取所有国家/地区及其频率的列表。我也想随着年龄的增长做同样的事情。如您所见,并非所有女性都公布年龄。

我尝试过

women= pd.read_csv('women.csv')
women_count = pd.Series(' '.join(women.Info).split()).value_counts()

但是,这会分割所有内容并输出:

Woman:     4
(USA)      2
21y        1
(Trinidad  1
and        1
Tobago)    1
30y        1

我应该补充一点,我已经尝试过women_filtered = women[women['Info'] == '(USA)'],但这没用

我的问题是:

  1. 我如何分割字符串以仅按国家/地区过滤,尤其是因为所有国家/地区都在括号中?
  2. 如何过滤没有年龄的条目?

谢谢

2 个答案:

答案 0 :(得分:1)

print(df)

      Name                       Info      Bio
0    Alice           Woman: 21y (USA)  Actress
1    Carla  30y (Trinidad and Tobago)   Singer
2  Breonna            Woman: (France)  Actress
3    Diana               Woman: (USA)   Singer

#Solution



#Extract Name of countries
 df=df.assign(Age=df.Info.str.extract('(\d+(?=\D))'), Countries=df.Info.str.extract('\((.*?)\)'))

Name                       Info             Bio     Age                   Countries
    0    Alice           Woman: 21y (USA)  Actress   21                  USA
    1    Carla  30y (Trinidad and Tobago)   Singer   30  Trinidad and Tobago
    2  Breonna            Woman: (France)  Actress  NaN               France
    3    Diana               Woman: (USA)   Singer  NaN                  USA
    



#Filter without Age
df[df.Age.isna()]

     Name             Info      Bio  Age  Countries
2  Breonna  Woman: (France)  Actress  NaN    France
3    Diana     Woman: (USA)   Singer  NaN       USA

答案 1 :(得分:1)

import pandas as pd

df = pd.DataFrame(
{'Name':['Alice', 'Breonna', 'Carla', 'Diana'],
 'Info':['Woman: 21y (USA)', 'Woman: (France)', 'Woman: 30y (Trinidad and Tobago)', 'Woman: (USA)'],
 'Bio':['Actress', 'Singer', 'Actress', 'Singer']}
)

# defining columns using regex
df['country'] = df['Info'].str.extract('\(([^\)]+)\)')
df['age'] = df['Info'].str.extract('[\s]+([\d]{2})y[\s]+').astype(float)
df['noage'] = df['age'].isnull().astype(int)

# frequency of countries
sizes = df.groupby('country').size()
sizes

这将输出频率。

country
France                 1
Trinidad and Tobago    1
USA                    2
dtype: int64

我将查找如何编写正则表达式,以便您可以自己学习如何从字符串中提取信息。 Pythex.org是一个不错的网站,可以在Python中试用正则表达式,并提供了一些有用的提示。