如何在python / kaggle数据集中的列中提取特定值

时间:2018-04-06 03:57:33

标签: python regex pandas kaggle

我试图从" jobpost"中提取以下字段。柱:

  1.Job Title
  2. Position Duration
  3. Job Responsibilities
  4. Required Qualifications
https://www.kaggle.com/madhab/jobposts/data中的

。我尝试过切片和正则表达式,但我仍然无法获得我想要的值。

我试图从jobpost中的一个列中提取并使用正则表达式但仍然无法获得结果

 gh = df2[2] 
 pattern = re.compile(r'JOB TITLE:.*\S|POSITION DURATION: .*\S|POSITION LOCATION: .*\S|JOB DESCRIPTION: .*\S |JOB RESPONSIBILITIES: .*\S|REQUIRED QUALIFICATIONS: .*\S|REMUNERATION: .*\S')

输出

<_sre.SRE_Match object; span=(43, 74), match='JOB TITLE:  Country Coordinator'>
<_sre.SRE_Match object; span=(76, 122), match='POSITION DURATION:   Renewable annual contract'>
<_sre.SRE_Match object; span=(124, 159), match='POSITION LOCATION: Yerevan, Armenia'>
<_sre.SRE_Match object; span=(161, 219), match='JOB DESCRIPTION:   Public outreach and strengthen>
<_sre.SRE_Match object; span=(1141, 1192), match='REMUNERATION:  Salary commensurate with experienc>

正如您所见,我无法提取&#34;工作职责:&#34;和&#34;要求的资格:&#34;值。我甚至试图这样做但没有结果

  aa= "JOB RESPONSIBILITIES:  \r\n- Working with the Country Director to provide environmental information\r\nto the general public via regular electronic communications and serving\r\nas the primary local contact to Armenian NGOs and businesses and the\r\nArmenian offices of international organizations and agencies;\r\n- Helping to organize and prepare CENN seminars/ workshops;\r\n- Participating in defining the strategy and policy of CENN in Armenia,\r\nthe Caucasus region and abroad.\r\nREQUIRED QUALIFICATIONS:  \r\n- Degree in environmentally related field, or 5 years relevant\r\nexperience;\r\n- Oral and written fluency in Armenian, Russian and English;\r\n- Knowledge/ experience of working with environmental issues specific to\r\nArmenia is a plus."

pattern= re.compile(r"(?s)JOB RESPONSIBILITIES: .*")
print(pattern.match(gh).group()) 

输出:

AttributeError: 'NoneType' object has no attribute 'group'

那么我如何解决这个问题我应该使用什么方法才能获得我想要的值?我还是新手。提前谢谢。

1 个答案:

答案 0 :(得分:0)

如果jobpost列的文字结构一致(我只查看了数据的在线预览),您可以将后续停用词用作锚点,例如从一个单词开始直到你到达下一个单词的所有文本:

(?s)JOB TITLE:.*(?=POSITION LOCATION)

如果停用词的顺序是变量,则可以使用带有替换的否定断言,例如

(?s)JOB TITLE:((?!POSITION LOCATION|JOB RESPONSIBILITIES|REQUIRED QUALIFICATIONS).)*

Demo