如何在R中另一个更大的字符串列表中获取给定列表的字符串频率?

时间:2018-07-04 06:24:04

标签: python r

我在python中有以下代码:

# most popular language list 

programing_language_list = ['python', 'java', 'c++', 'php', 'javascript', 'objective-c', 'ruby', 'perl','c','c#', 'sql','kotlin']

# get our Minimum Qualifications column and convert all of the values to a list

minimum_qualifications = df_job_skills['Minimum Qualifications'].tolist()

# let's join our list to a single string and lower case the letter

miniumum_qualifications_string = "".join(str(v) for v in minimum_qualifications).lower()

# find out which language occurs in most in minimum Qualifications string

wordcount = dict((x,0) for x in programing_language_list)
for w in re.findall(r"[\w'+#-]+|[.!?;’]", miniumum_qualifications_string):
    if w in wordcount:
       wordcount[w] += 1

现在我想在R中做同样的尝试:

# most popular language list 

programing_language_list = list('python', 'java', 'c++', 'php', 'javascript', 'objective-c', 'ruby', 'perl','c','c#', 'sql','kotlin')
#match(c('python',),programing_language_list)

# get our Minimum Qualifications column and convert all of the values to a list

minimum_qualifications = list(dataset[,6])

# let's join our list to a single string and lower case the letter

miniumum_qualifications_string = sapply(paste(unlist(minimum_qualifications),sep=', ',collapse = ""),tolower)

#install.packages("stringr")

library(stringr)

# find out which language occurs in most in minimum Qualifications string


res_min = regmatches(miniumum_qualifications_string,gregexpr("[\\w'+#-]+|[.!?;']",miniumum_qualifications_string,perl = TRUE))

在R中没有dict的情况下,我试图以这种方式进行回合:

k=0
for( w in res_min)
{
  for(i in programing_language_list)
  {

      if(i == w) 
      {
        j[k]=i
        print(j[k])
        k=k+1
      }
  }
} 

但是他显示了这样的输出:

警告消息:

1: In if (i == w) { ... :
  the condition has length > 1 and only the first element will be used
2: In if (i == w) { ... :
  the condition has length > 1 and only the first element will be used
3: In if (i == w) { ... :
  the condition has length > 1 and only the first element will be used
4: In if (i == w) { ... :
  the condition has length > 1 and only the first element will be used
5: In if (i == w) { ... :
  the condition has length > 1 and only the first element will be used
6: In if (i == w) { ... :
  the condition has length > 1 and only the first element will be used
7: In if (i == w) { ... :
  the condition has length > 1 and only the first element will be used
8: In if (i == w) { ... :
  the condition has length > 1 and only the first element will be used
9: In if (i == w) { ... :
  the condition has length > 1 and only the first element will be used
10: In if (i == w) { ... :
  the condition has length > 1 and only the first element will be used
11: In if (i == w) { ... :
  the condition has length > 1 and only the first element will be used
12: In if (i == w) { ... :
  the condition has length > 1 and only the first element will be used

现在我的目的是找到

的字符串的频率
  

programming_language_list

  

res_min

我的目的是获得一个

  

dict

像Python中的

数据结构,并获得12×2的矩阵数据结构,该数据结构的第一列中将包含

  

“ Python”,“ C ++”

在第二列中,列表中将包含相同字符串的计数

  

res_min

感谢您的帮助。预先感谢。

这是数据集网址:

https://www.kaggle.com/niyamatalmass/google-job-skills

2 个答案:

答案 0 :(得分:0)

您的问题似乎在生成miniumum_qualifications_string时是一个错误。

使用sep = ", ", collapse = ""基本上没有任何作用。您只需要collapse = ","

示例:

set.seed(1)
programing_language_list = list('python', 'java', 'c++', 'php', 'javascript', 'objective-c', 'ruby', 'perl','c','c#', 'sql','kotlin')
minimum_qualifications <- sample(programing_language_list, 10, replace = T)

现在您的paste创建了此文件:

miniumum_qualifications_string = sapply(paste(unlist(minimum_qualifications),sep=', ',collapse = ""),tolower)

  phpjavascriptrubysqlc++sqlkotlinperlperlpython 
"phpjavascriptrubysqlc++sqlkotlinperlperlpython" 

miniumum_qualifications_string = sapply(paste(unlist(minimum_qualifications), collapse = ","),tolower)

输出正确的分隔字符串:

 php,javascript,ruby,sql,c++,sql,kotlin,perl,perl,python 
"php,javascript,ruby,sql,c++,sql,kotlin,perl,perl,python" 

然后可以通过regmatches进行进一步修改:

res_min = regmatches(miniumum_qualifications_string,gregexpr("[\\w'+#-]+|[.!?;']",miniumum_qualifications_string,perl = TRUE))

$`php,javascript,ruby,sql,c++,sql,kotlin,perl,perl,python`
 [1] "php"        "javascript" "ruby"       "sql"        "c++"        "sql"        "kotlin"     "perl"       "perl"       "python"    

现在,由于regmatches输出了一个列表,您需要对其进行unlist循环使用for

k=0
j <- vector("character", 0)
for( w in unlist(res_min))
{
  for(i in programing_language_list)
  {

    if(i == w) 
    {
      j[k]=i
      print(j[k])
      k=k+1
    }
  }
} 

[1] "javascript"
[1] "ruby"
[1] "sql"
[1] "c++"
[1] "sql"
[1] "kotlin"
[1] "perl"
[1] "perl"
[1] "python"

> k
[1] 10

> j
[1] "javascript" "ruby"       "sql"        "c++"        "sql"        "kotlin"     "perl"       "perl"       "python"  

答案 1 :(得分:0)

#最受欢迎的语言列表

programing_language_list = list('python', 'java', 'c++', 'php', 'javascript', 'objective-c', 'ruby', 'perl','c','c#', 'sql','kotlin')
#match(c('python',),programing_language_list)

# get our Minimum Qualifications column and convert all of the values to a list

minimum_qualifications = list(dataset[,6])

# let's join our list to a single string and lower case the letter

miniumum_qualifications_string = sapply(paste(unlist(minimum_qualifications),sep=', ',collapse = ""),tolower)

#install.packages("stringr")

library(stringr)

# find out which language occurs in most in minimum Qualifications string


res_min = regmatches(miniumum_qualifications_string,gregexpr("[\\w'+#-]+|[.!?;']",miniumum_qualifications_string,perl = TRUE))

# this is the frequency table of the list res_min
res_min2=table(res_min)
res_min2=sort(res_min2, decreasing = TRUE)
programming_language_table[1,2]=res_min2["python"]
programming_language_table[2,2]=res_min2["java"]
programming_language_table[3,2]=res_min2["c++"]
programming_language_table[4,2]=res_min2["php"]
programming_language_table[5,2]=res_min2["javascript"]
programming_language_table[6,2]=res_min2["objective-c"]
programming_language_table[7,2]=res_min2["ruby"]
programming_language_table[8,2]=res_min2["perl"]
programming_language_table[9,2]=res_min2["c"]
programming_language_table[10,2]=res_min2["c#"]
programming_language_table[11,2]=res_min2["sql"]
programming_language_table[12,2]=res_min2["kotlin"]

programming_language_table=programming_language_table[order(- 
programming_language_table$no_of_counts),]

输出为:

python       97

javascript   77

java         76

sql          73

c++          54

c            17

c#           15

ruby         14

php           7

perl          6

objective-c   3

kotlin        3