估算缺失值

时间:2020-07-06 16:19:25

标签: regex pandas nan

我正在寻找过滤人才库数据集数据框以仅从位置列中捕获城市和州的信息(它当前包含这样的字符串,“英国伦敦的软件开发人员”)。我尝试将NaN值替换为0,并通过将数据框设置为仅返回NaN值(已按预期返回空数据框)来确认已完成此操作。但是每次我运行最后一条语句时,都会出现以下错误:“ ValueError:无法使用包含NA / NaN值的数组进行屏蔽” 为什么会这样?

talentpool_subset = talentpool_df[['name', 'profile', 'location','skills']]
talentpool_subset

talentpool_subset['location'].fillna(0, inplace=True)
location = talentpool_subset['location'].isna()
talentpool_subset[location]

talentpool_subset[talentpool_subset['location'].str.contains(r'(?<=in).*')]

    name    profile     url     source  github  location    skills  tags_strong     tags_expert     is_available    description
0   Hugo L. Samayoa     DevOps Developer    https://www.toptal.com/resume/hugo-l-samayoa    toptal  NaN     DevOps Developer in Long Beach, CA, United States   {"Paradigms":["Agile Software Development","Sc...   NaN     ["Linux System Administration","VMware ESXi","...   available   "DevOps before DevOps" is a term mostly associ...
1   Stepan Yakovenko    Software Developer  https://www.toptal.com/resume/stepan-yakovenko  toptal  stiv-yakovenko  Software Developer in Novosibirsk, Novosibirsk...   {"Platforms":["Debian Linux","Windows","Linux"...   ["Linux","C++","AngularJS"]     ["Java","HTML5","CSS","JavaScript","MySQL","Hi...   available   Stepan is an experienced software developer wi...
2   Slobodan Gajic  Software Developer  https://www.toptal.com/resume/slobodan-gajic    toptal  bobangajicsm    Software Developer in Sremska Mitrovica, Vojvo...   {"Platforms":["Firebase","XAMPP"],"Storage":["...   ["Firebase","Karma"]    ["jQuery","HTML5","CSS3","Git","JavaScript","S...   available   Slobodan is a front-end developer with a Bache...
3   Bruno Furtado Montes Oliveira   Visual Studio Team Services (VSTS) Developer    https://www.toptal.com/resume/bruno-furtado-mo...   toptal  brunofurmon     Visual Studio Team Services (VSTS) Developer i...   {"Paradigms":["Agile","CQRS","Azure DevOps"],"...   ["Windows","C#",".NET","SQL","Python","jQuery"...   NaN     available   Since 2013, Bruno has been making a living as ...
4   Jennifer Aquino     Query Optimization Developer    https://www.toptal.com/resume/jennifer-aquino   toptal  BlueCamelArt    Query Optimization Developer in West Ryde, New...   {"Paradigms":["Automation","ETL Implementation...   ["Data Warehouse","Unix","Oracle 10g","Automat...   ["SQL","SQL Server Integration Services (SSIS)...   available   Jennifer has five years of professional experi...

1 个答案:

答案 0 :(得分:0)

这里假设目标是获取位置,并且不需要使用遮罩进行定位。以下代码使用.extract()city, state保留在location列中。

例如:Long Beach, CA, United States中的DevOps Developer in Long Beach, CA, United States

# Import libraries
import pandas as pd
import numpy as np


# Create list using text from question
name = ['Hugo L. Samayoa','Stepan Yakovenko','Slobodan Gajic','Bruno Furtado Montes Oliveira','Jennifer Aquino']
profile = ['DevOps Developer','Software Developer','Software Developer','Visual Studio Team Services (VSTS) Developer','Query Optimization Developer']
url = ['https://www.toptal.com/resume/hugo-l-samayoa','https://www.toptal.com/resume/stepan-yakovenko','https://www.toptal.com/resume/slobodan-gajic','https://www.toptal.com/resume/bruno-furtado-mo...','https://www.toptal.com/resume/jennifer-aquino']
source = ['toptal','toptal','toptal','toptal','toptal']
github = [np.nan, 'stiv-yakovenko','bobangajicsm','brunofurmon','BlueCamelArt']
location = ['DevOps Developer in Long Beach, CA, United States', 'Software Developer in Novosibirsk, Novosibirsk','Software Developer in Sremska Mitrovica, Vojvo','Visual Studio Team Services (VSTS) Developer in New York','Query Optimization Developer in West Ryde, New York']
skills = ['{"Paradigms":["Agile Software Development","Sc...', '{"Platforms":["Debian Linux","Windows","Linux"...','{"Platforms":["Firebase","XAMPP"],"Storage":["...','{"Paradigms":["Agile","CQRS","Azure DevOps"],"...','{"Paradigms":["Automation","ETL Implementation...']

# Create DataFrame using list above
talentpool_df = pd.DataFrame({
    'name':name,
    'profile':profile,
    'url':url,
    'source':source,
    'github':github,
    'location':location,
    'skills':skills
})

# Add NaN row to DataFrame
talentpool_df.loc[6,:] = np.nan

# Subset DataFrame to get columns of interest
talentpool_subset = talentpool_df[['name', 'profile', 'location','skills']]

# Use .extract() to keep only text after 'in' in the 'location' column
talentpool_subset['location'] = talentpool_subset['location'].str.extract(r'((?<=in).*)')

输出

talentpool_subset

enter image description here