Question

我有一个包含1000行的数据集，其中包含给定作者和属于该作者的大量文本集。我最终要实现的目标是将文本行分解为包含相同数量单词的多行，如下所示：

Author - - - - - - - - text

Jack - - - - - - -- - -"This is a sentence that contains eight words" 

John - - - - - - - - -"This is also a sentence containing eight words"

因此，如果我想对4个单词的块进行处理，则应该是：

Author- - - - - - text

Jack- - - - - - - "This is a sentence" 

Jack- - - -  - - -"that contains eight words" 


John- - - - - - - "This is also a"

John- - - - - - - "sentence containing eight words"

我已经可以使用textwrapper按字符数执行此操作，但理想情况下，我希望按单词数进行操作。我们将不胜感激能够提供帮助的任何帮助，谢谢！

Answer 1

假设您使用的熊猫> = 0.25（支持df.explode），则可以使用以下方法：

def split_by_equal_number_of_words(df, num_of_words, separator=" "):
    """
      1. Split each text entry to a list separated by 'separator'
      2. Explode to a row per word
      3. group by number of the desired words, and aggregate by joining with the 'separator' provided 
    :param df:
    :param num_of_words:
    :param separator:
    :return:
    """
    df["text"] = df["text"].str.split(separator)
    df = df.explode("text").reset_index(drop=True)
    df = df.groupby([df.index // num_of_words, 'author'])['text'].agg(separator.join)
    return df

有没有一种简单的方法可以将Pandas DataFrame上的大字符串分成相等数量的单词？

1 个答案: