有关按答案组对聊天数据进行采样的熊猫字符串操作的问题。
您好,我想将聊天数据集分为训练数据集和测试数据集。 我想知道Pandas Dataframe有什么好的方法。
原始数据框
1 2 3
A Hi Hello, there
A How are you Hello, there
A What's up Hello,there
B What is your name, My name is Thomas
B May I know your name? My name is Thomas
...
-> 培训数据框
1 2 3
A Hi Hello, there
A How are you Hello, there
B What is your name, My name is Thomas
...
测试数据框
1 2 3
A What's up Hello,there
B May I know your name? My name is Thomas
...
基本上,[Col 3](答案)有几个Questions [Col 2]映射。 我想根据相同的答案组提取样本问题并回答10%到20%的训练和测试数据。
这是一种复杂的方法,只有在答案有两个以上的问题时才能识别它。
Pandas数据框有什么好的方法吗?
答案 0 :(得分:1)
此解决方案有点粗略,但可以。据我所知,没有一种简单的方法可以从数据帧的子组中提取Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Failed when searching for node puppetagent.fritz.box: Exception while executing '/etc/puppetlabs/puppet/node.rb': Cannot run program "/etc/puppetlabs/puppet/node.rb" (in directory "."): error=2, No such file or directory
个随机样本。您可以做的是按照答案对数据进行分组,然后将问题连接到一个列表中,然后选择一些随机元素。为此,您的数据框应如下所示:
n
现在按答案分组:
import pandas as pd
data = {
'Question': [[['Hi Hello']], [['How are you']], [['Whats up']], [['What is your name']], [['May I know your name?']]],
'Answer':['there', 'there', 'there', 'My name is Thomas', 'My name is Thomas']
}
df = pd.DataFrame(data)
## df Output ##
Question Answer
0 [[Hi Hello]] there
1 [[How are you]] there
2 [[Whats up]] there
3 [[What is your name]] My name is Thomas
4 [[May I know your name?]] My name is Thomas
现在迭代每一行,并选择要训练和测试的行。注意,在此示例中,提取并非完全随机。我选择第一个new_df = df.groupby('Answer').sum().reset_index()
## Output ##
Answer Question
0 My name is Thomas [[What is your name], [May I know your name?]]
1 there [[Hi Hello], [How are you], [Whats up]]
进行训练,最后一个n
进行测试。
length(answer_group) - n
完整的工作代码:
train_file = open('train.csv', 'a')
test_file = open('test.csv', 'a')
for _, instance in new_df.iterrows():
n_questions = len(instance.Question)
splits = int(2 * n_questions / 3) # Assuming you want a train/test split of 3:1
train = instance.Question[:splits]
for train_example in train:
train_file.write(train_example[0] + ',' + instance.Answer + '\n')
test = instance.Question[splits:]
for test_example in test:
test_file.write(test_example[0] + ',' + instance.Answer + '\n')
## Files output ##
# train.csv #
What is your name,My name is Thomas
Hi Hello,there
How are you,there
# test.csv #
May I know your name?,My name is Thomas
Whats up,there
编辑:我只是发现我错了问题和答案的内容,但这是由于原始帖子的格式错误。无论哪种方式,逻辑都是完全相同的。