Create a "virtual" corpus in python

时间:2015-12-14 17:54:06

标签: python r dataframe virtualization corpus

I need to create a corpus from a huge dataframe (or any python equivalent to the r dataframe) by splitting it in so many dataframes as the usernames.

For example I start from a dataframe like this:

username    search_term
name_1      "some_text_1"
name_1      "some_text_2"
name_2      "some_text_3"
name_2      "some_text_4"
name_3      "some_text_5"
name_3      "some_text_6"
name_3      "some_text_1"

[...]

name_n      "some_text_n-1"

And I want to obtain:

data frame 1
username    search_term
name_1      "some_text_1"
name_1      "some_text_2"

data frame 2
username    search_term
name_2      "some_text_3"
name_2      "some_text_4"

And so on..

I already asked this question for R, but now I realised that using the python NLTK could be an advantage for me. I found out that in R i can create a virtual corpus. Is it the same in python? Or is there another way to solve this problem in python?

To see how I solved this problem in R see:

Split a huge dataframe in many smaller dataframes to create a corpus in r

How transform a list into a corpus in r?

1 个答案:

答案 0 :(得分:-1)

这是你在R

中的解决方案

我创建了一个类似的data.frame df

df <- data.frame(group = rep(1:6, each = 2) , value = 1:12)

以下是未来小型data.frames的组和名称索引

idx <- unique(df$group)
nms <- paste0('df', idx)

接下来,在for循环中,我创建了这些小型data.frames

for(i in idx){
  df_tmp <- df[df$group == i, ]
  do.call('<-', list(nms[i], df_tmp))
}