Cumulative Unique words in huge dataframe

时间:2019-04-08 13:55:35

标签: python pandas dataframe nlp

How do I get cumulative unique words from a dataframe column which has more than 500 words per. Dataframe has ~300,000 rows

I read the csv file in a dataframe with column A having text data. I have tried creating couple of columns (B & C) by looping through column A and taking unique words from column A as set and appending Column B with unique words and Column C with count

Subsequently I take unique words by taking Column A and column B(union) from previous row(set)

This works for small number of rows. But once number of rows exceeds 10,000 performance degrades and kernal eventually dies

Is there any better way of doing this for huge dataframe ?

Tried creating seperate datafram with just unique words and count, but still have issue

Sample code:

for index, row in DF.iterrows():
      if index = 0:
          result = set(row['Column A'].lower().split()
          DF.at[index, 'Column B'] = result
      else:
          result = set(row['Column A'].lower().split()
          DF.at[index, 'Cloumn B'] = result.union(DF.loc[index -1, 
                                                'Column B'])
DF['Column C'] = DF['Column B'].apply(len)

2 个答案:

答案 0 :(得分:0)

您可以使用CountVectorizer并随后进行累加。

了解有关CountVectorizer的更多信息:https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html 大熊猫的累积总和:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cumsum.html

答案 1 :(得分:0)

利用字典键的唯一性来累积单词。

我创建了一个字典cumulative_words,要在其中逐行存储唯一单词,方法是使用具有由给定行句子中唯一单词组成的键的字典来更新它。

代码:

cumulative_words = {}

def cumulate(x):
    cumulative_words.update(dict.fromkeys(set(x.lower().split())))
    return list(cumulative_words.keys())

df["Column B"] = df["Column A"].apply(cumulate)
df["Column C"] = df["Column B"].apply(len)

更新:

鉴于您说这段代码在大约200k行中仍然存在内存问题, 我将尝试一些非常简单的方法,以使您理解得更多:

  1. 只需更新累积字典

在数据框操作之前用唯一的单词创建字典

cumulative_words = {}

for x in df["Column A"].values:
    cumulative_words.update(dict.fromkeys(set(x.lower().split())))

如果仍然无法解决,我认为我们必须更改方法

  1. 将单词添加到列表

这是我认为的关键点,因为它创建了大约数十亿个单词的列表

cumulative_words = {}
cumulative_column = []

for x in df["Column A"].values:
    cumulative_words.update(dict.fromkeys(set(x.lower().split())))
    cumulative_column.append(cumulative_words.keys())
  1. 将创建的列表分配给B列并计数
df["Column B"] = cumulative_column
df["Column C"] = df["Column B"].apply(len)

也许要存储的单词太多,无法创建数据框,或者我不知道该怎么做。让我知道