Question

我有一个带有长文档的1行数据框。我想按句子（sent_tokenize）分割文档，然后为每个句子创建行，以使观察的数量从1个（文档）增加到10,000个（句子）观察。例如，我的文档只有一行，如下所示：

      document                      category
0     life is full of fake data..   wonderland

我想要的是按句子拆分文档并为所有句子创建行

      document                      category
0     life is full of fake data..   wonderland
1     but you have to sort out..    wonderland
2     what is fake what is not..    wonderland
      ..........
10000 you will get what you want.   wonderland

老实说，我不知道该如何处理。我用sent_tokenize标记了句子，但是不知道如何按句子split创建行。

谢谢。

Answer 1

我确信可以有更有效的方法，但这足够灵活以提供所需的输出。基本上，遍历数据框，将基于文本的单元格拆分为句子，然后创建新行，同时保留每个句子的类别：

test = """This is a sentence. This is another sentence. 
          This is a third sentence. We want a separate row for each sentence."""


df = pd.DataFrame({'docs': test, 'category': 'winterland'}, index=[0])

df_new = pd.concat([pd.DataFrame({'doc': doc, 'category': row['category']}, index=[0]) 
           for _, row in df.iterrows() 
           for doc in row['docs'].split('.') if doc != ''])

df_new应该具有所需的输出。您可以在此处使用send_tokenize，或者对于更高级的句子边界检测，可以使用Spacy's send方法。 Spacy具有许多惊人的功能，并且可以针对NLP项目进行自定义。

Answer 2

另一种方法是将('.')

分开

因此使用与datawrestler相同的测试：

test = """This is a sentence. This is another sentence. This is a third sentence. We want a separate row for each sentence."""

我们可以将行拆分为一个列表，然后将其馈送到数据框，如下所示：

df = pd.DataFrame({'docs': test.split('.'), 'category': 'winterland'})

结果的唯一区别是您将在底部有一个空白行，可以根据需要过滤掉该行，或者在创建数据框时可以使用列表推导来排除空白行，如下所示：

pd.DataFrame({'docs': [sentence for sentence in test.split('.') if sentence !=''], 'category': 'winterland'})

Answer 3

您可以使用textblob：

from textblob import TextBlob
text1='''That's right....the red velvet cake.....ohhh this stuff is so good.    
They never brought a salad we asked for.    
This hole in the wall has great Mexican street tacos, and friendly staff.'''

blob=TextBlob(text1)
df=blob.sentences

如何在pandas数据框中拆分文档并为每个句子创建行

3 个答案: