我试图一次在pandas DataFrame中创建几个列,其中每个列名都是字典中的一个键,并且如果存在与该键对应的任何值,则该函数返回1。
我的DataFrame具有3列jp_ref,jp_title和jp_description。本质上,我正在jp_descriptions中搜索分配给该键的相关单词,并根据jp_description中是否存在任何值,用1和0填充分配给该键的列。
jp_tile = [‘software developer’, ‘operations analyst’, ‘it project manager’]
jp_ref = [‘j01’, ‘j02’, ‘j03’]
jp_description = [‘software developer with java and sql experience’, ‘operations analyst with ms in operations research, statistics or related field. sql experience desired.’, ‘it project manager with javascript working knowledge’]
myDict = {‘jp_title’:jp_title, ‘jp_ref’:jp_ref, ‘jp_description’:jp_description}
data = pd.DataFrame(myDict)
technologies = {'java':['java','jdbc','jms','jconsole','jprobe','jax','jax-rs','kotlin','jdk'],
'javascript':['javascript','js','node','node.js','mustache.js','handlebar.js','express','angular'
'angular.js','react.js','angularjs','jquery','backbone.js','d3'],
'sql':['sql','mysql','sqlite','t-sql','postgre','postgresql','db','etl']}
def term_search(doc,tech):
for term in technologies[tech]:
if term in doc:
return 1
else:
return 0
for tech in technologies:
data[tech] = data.apply(term_search(data['jp_description'],tech))
我收到以下错误,但不理解:
TypeError: ("'int' object is not callable", 'occurred at index jp_ref')
答案 0 :(得分:1)
您的逻辑是错误的,您正在循环遍历列表,并且在第一次迭代后它返回0或1,因此永远不会将jp_description
的值与完整列表进行比较。
您拆分了jp_description并使用技术命令检查了公共元素,如果存在公共元素,则意味着找到了子字符串,因此返回1,否则返回0
def term_search(doc,tech):
doc = doc.split(" ")
common_elem = list(set(doc).intersection(technologies[tech]))
if len(common_elem)>0:
return 1
return 0
for tech in technologies:
data[tech] = data['jp_description'].apply(lambda x : term_search(x,tech))
jp_title jp_ref jp_description java javascript sql
0 software developer j01 software developer.... 1 0 1
1 operations analyst j02 operations analyst .. 0 0 1
2 it project manager j03 it project manager... 0 1 0