有关按答案组采样聊天数据的熊猫字符串操作的问题

时间:2019-01-21 10:08:39

标签: python string pandas dataframe chatbot

有关按答案组对聊天数据进行采样的熊猫字符串操作的问题。

您好,我想将聊天数据集分为训练数据集和测试数据集。 我想知道Pandas Dataframe有什么好的方法。

原始数据框

1   2   3

A  Hi  Hello, there 

A  How are you  Hello, there

A  What's up  Hello,there

B  What is your name, My name is Thomas

B  May I know your name?  My name is Thomas

...

-> 培训数据框

1   2   3

A  Hi  Hello, there

A  How are you  Hello, there

B  What is your name, My name is Thomas

...

测试数据框

1   2   3

A  What's up  Hello,there

B  May I know your name?  My name is Thomas

...

基本上,[Col 3](答案)有几个Questions [Col 2]映射。 我想根据相同的答案组提取样本问题并回答10%到20%的训练和测试数据。

这是一种复杂的方法,只有在答案有两个以上的问题时才能识别它。

Pandas数据框有什么好的方法吗?

1 个答案:

答案 0 :(得分:1)

此解决方案有点粗略,但可以。据我所知,没有一种简单的方法可以从数据帧的子组中提取Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Failed when searching for node puppetagent.fritz.box: Exception while executing '/etc/puppetlabs/puppet/node.rb': Cannot run program "/etc/puppetlabs/puppet/node.rb" (in directory "."): error=2, No such file or directory 个随机样本。您可以做的是按照答案对数据进行分组,然后将问题连接到一个列表中,然后选择一些随机元素。为此,您的数据框应如下所示:

n

现在按答案分组:

import pandas as pd

data = {
    'Question': [[['Hi Hello']], [['How are you']], [['Whats up']], [['What is your name']], [['May I know your name?']]], 
    'Answer':['there', 'there', 'there', 'My name is Thomas', 'My name is Thomas']
}

df = pd.DataFrame(data)

## df Output ##
                    Question             Answer
0               [[Hi Hello]]              there
1            [[How are you]]              there
2               [[Whats up]]              there
3      [[What is your name]]  My name is Thomas
4  [[May I know your name?]]  My name is Thomas

现在迭代每一行,并选择要训练和测试的行。注意,在此示例中,提取并非完全随机。我选择第一个new_df = df.groupby('Answer').sum().reset_index() ## Output ## Answer Question 0 My name is Thomas [[What is your name], [May I know your name?]] 1 there [[Hi Hello], [How are you], [Whats up]] 进行训练,最后一个n进行测试。

length(answer_group) - n

完整的工作代码:

train_file = open('train.csv', 'a')
test_file = open('test.csv', 'a')

for _, instance in new_df.iterrows():

    n_questions = len(instance.Question)
    splits = int(2 * n_questions / 3) # Assuming you want a train/test split of 3:1

    train = instance.Question[:splits]
    for train_example in train:
        train_file.write(train_example[0] + ',' + instance.Answer + '\n')

    test = instance.Question[splits:]
    for test_example in test:
        test_file.write(test_example[0] + ',' + instance.Answer + '\n')

    ## Files output ##

    # train.csv #
    What is your name,My name is Thomas
    Hi Hello,there
    How are you,there

    # test.csv #
    May I know your name?,My name is Thomas
    Whats up,there

编辑:我只是发现我错了问题和答案的内容,但这是由于原始帖子的格式错误。无论哪种方式,逻辑都是完全相同的。