我在过滤此CSV文件时遇到困难。
以下是csv表中的一些条目:
Name Info Bio
Alice Woman: 21y (USA) Actress
Breonna Woman: (France) Singer
Carla Woman: 30y (Trinidad and Tobago) Actress
Diana Woman: (USA) Singer
我正在尝试过滤“信息”行以获取所有国家/地区及其频率的列表。我也想随着年龄的增长做同样的事情。如您所见,并非所有女性都公布年龄。
我尝试过
women= pd.read_csv('women.csv')
women_count = pd.Series(' '.join(women.Info).split()).value_counts()
但是,这会分割所有内容并输出:
Woman: 4
(USA) 2
21y 1
(Trinidad 1
and 1
Tobago) 1
30y 1
我应该补充一点,我已经尝试过women_filtered = women[women['Info'] == '(USA)']
,但这没用
我的问题是:
谢谢
答案 0 :(得分:1)
print(df)
Name Info Bio
0 Alice Woman: 21y (USA) Actress
1 Carla 30y (Trinidad and Tobago) Singer
2 Breonna Woman: (France) Actress
3 Diana Woman: (USA) Singer
#Solution
#Extract Name of countries
df=df.assign(Age=df.Info.str.extract('(\d+(?=\D))'), Countries=df.Info.str.extract('\((.*?)\)'))
Name Info Bio Age Countries
0 Alice Woman: 21y (USA) Actress 21 USA
1 Carla 30y (Trinidad and Tobago) Singer 30 Trinidad and Tobago
2 Breonna Woman: (France) Actress NaN France
3 Diana Woman: (USA) Singer NaN USA
#Filter without Age
df[df.Age.isna()]
Name Info Bio Age Countries
2 Breonna Woman: (France) Actress NaN France
3 Diana Woman: (USA) Singer NaN USA
答案 1 :(得分:1)
import pandas as pd
df = pd.DataFrame(
{'Name':['Alice', 'Breonna', 'Carla', 'Diana'],
'Info':['Woman: 21y (USA)', 'Woman: (France)', 'Woman: 30y (Trinidad and Tobago)', 'Woman: (USA)'],
'Bio':['Actress', 'Singer', 'Actress', 'Singer']}
)
# defining columns using regex
df['country'] = df['Info'].str.extract('\(([^\)]+)\)')
df['age'] = df['Info'].str.extract('[\s]+([\d]{2})y[\s]+').astype(float)
df['noage'] = df['age'].isnull().astype(int)
# frequency of countries
sizes = df.groupby('country').size()
sizes
这将输出频率。
country
France 1
Trinidad and Tobago 1
USA 2
dtype: int64
我将查找如何编写正则表达式,以便您可以自己学习如何从字符串中提取信息。 Pythex.org是一个不错的网站,可以在Python中试用正则表达式,并提供了一些有用的提示。