Question

对于上下文，我正在查看数据科学家职位和职位描述的数据集，并且试图确定在这些职位描述中引用的每个学位级别是多少。

我能够使代码在一个特定的职位描述上工作，但是现在我需要做一个“ for循环”或等效操作，以遍历“描述列”并累计计算每个教育水平的次数被引用。

sentence = set(data_scientist_filtered.description.iloc[30].split())
degree_level = {'level_1':{'bachelors','bachelor','ba'},
    'level_2':{'masters','ms','m.s',"master's",'master of science'},
    'level_3':{'phd','p.h.d'}}
results = {}
for key, words in degree_level.items():
    results[key] = len(words.intersection(sentence))
results

示例字符串将如下所示： data_scientist_filtered.description.iloc [30] =

 'the team: the data science team is a newly formed applied research team within s&amp;p global ratings that will be responsible for building and executing a bold vision around using machine learning, natural language processing, data science, knowledge engineering, and human computer interfaces for augmenting various business processes.\n\nthe impact: this role will have a significant impact on the success of our data science projects ranging from choosing which projects should be undertaken, to delivering highest quality solution, ultimately enabling our business processes and products with ai and data science solutions.\n\nwhat’s in it for you: this is a high visibility team with an opportunity to make a very meaningful impact on the future direction of the company. you will work with senior leaders in the organization to help define, build, and transform our business. you will work closely with other senior scientists to create state of the art augmented intelligence, data science and machine learning solutions.\n\nresponsibilities: as a data scientist you will be responsible for building ai and data science models. you will need to rapidly prototype various algorithmic implementations and test their efficacy using appropriate experimental design and hypothesis validation.\n\nbasic qualifications: bs in computer science, computational linguistics, artificial intelligence, statistics, or related field with 5+ years of relevant industry experience.\n\npreferred qualifications:\nms in computer science, statistics, computational linguistics, artificial intelligence or related field with 3+ years of relevant industry experience.\nexperience with financial data sets, or s&amp;p’s credit ratings process is highly preferred.

示例数据框：

 position       company       description             location
data scientist  Xpert Staffing  this job is for..      Atlanta, GA
data scientist  Cotiviti     great opportunity of..   Atlanta, GA

Answer 1

我建议在这里使用isin()方法，然后求和。

data = [['John',"ba"],['Harry',"ms"],['Bill',"phd"],['Mary', 'bachelors']]
df = pd.DataFrame(data,columns=['name','description'])

degree_level = {
    'level_1':{'bachelors','bachelor','ba'},
    'level_2':{'masters','ms','m.s',"master's",'master of science'},
    'level_3':{'phd','p.h.d'}
}

results = {}
for level, values in degree_level:
    results[level] = data_scientist_filtered['description'].isin(values).sum()

print(results)
#{"level_1": 2, "level_2": 1, "level_3": 1}

修改 for循环可以由理解代替，仅供参考。

def num_of_degrees(degrees):
    return data_scientist_filtered['description'].isin(values).sum()

results = {level: num_of_degrees(values) for level, values in degree_level}

编辑2

随着您展示df的外观，现在我知道了问题所在。您需要filter() df然后获得count()。

#just cleaning some unnessecary values from degrees_level
degree_level = {
'level_1':{'bachelor',' ba '},
'level_2':{'masters',' ms ',' m.s ',"master's"},
'level_3':{'phd','p.h.d'}}

results = {}

for level, values in degree_level:
    results[level] = df.query(' or '.join((f"column_name.str.contains({value})" for value in values)), case=False, engine='python').count()

类似的东西应该起作用

Answer 2

The simple way to do this breakup of text is by using n gram compare of text column by column. 
Create a list of position, company, location for possible values to be found.
Later compare the list column by column and save it in a data frame which can be combined lastly.

text1 = "Growing company located in the Atlanta, GA area is currently looking to add a Data Scientist to their team. The Data Scientist will analyze business level data to produce actionable insights utilizing analytics tools"

text2 = "Data scientist data analyst"

bigrams1 = ngrams(text1.lower().split(), n)  # For description 
bigrams2 = ngrams(text2.lower().split(), n)  # For position dictionary 

def compare(bigrams1, bigrams2):
    common=[]
    for grams in bigrams2:
       if grams in bigrams1:
         common.append(grams)
    return common

compare(bigrams1, bigrams2)

Output as 
compare(trigrams1,trigrams2)
Out[140]: [('data', 'scientist')]

如何在python中的dataframe列内容上使用apply function / for循环

2 个答案: