我有一个CSV文件,一个句子中的每个单词都用单元格表示,每个句子之间有一个空单元格。
我的问题在 run_id 列中,在我使用熊猫加载csv文件后,我使用函数“从df发送”分离了每个句子,但是我有一行断言会再次检查run_id是唯一且= 1,但失败,因为它将“ Null”作为“ Null句子”
下面是我的代码段,希望您能帮忙
注意:我正在T =“ test_RE”
[{"classification":"spark-defaults",
"properties":
{"spark.executor.memory":"36g",
"spark.driver.memory":"36g",
"spark.driver.cores":"3",
"spark.default.parallelism":"174",
"spark.executor.cores":"3",
"spark.executor.instances":"29",
"spark.yarn.executor.memoryOverhead":"4g",
"spark.dynamicAllocation.enabled":"false"}}]
功能
def load_dataset(fn,T):
if T=="test_RE":
df = pandas.read_csv(fn,
sep= ";",
header=0,
keep_default_na=False)
df.drop(df.columns[df.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)
df.word_id = pd.to_numeric(df.word_id, errors='coerce').astype('Int64')
df.run_id = pd.to_numeric(df.run_id, errors='coerce').astype('Int64')
df.sent_id = pd.to_numeric(df.sent_id, errors='coerce').astype('Int64')
df.head_pred_id = pd.to_numeric(df.head_pred_id, errors='coerce').astype('Int64')
else:
df = pandas.read_csv(fn,
sep= "\t",
header=0,
keep_default_na=False)
print (df.dtypes)
if T=="train":
encoder.fit(df.label.values)
print('this is the IF cond')
print('df.label.values. shape',df.label.values.shape)
sents = get_sents_from_df(df)
print('shape of sents 0',sents[0].shape)
print('sents[0]',sents[0])
print('shape of sents 1',sents[1].shape)
print('sents[1]',sents[1])
#make sure that all sents agree on run_id
assert(all([len(set(sent.run_id.values)) == 1
for sent in sents])) **ERROR HERE**
发送的0的形状为(10,8)正确,而发送的[0]正确
但是send 1的形状为(0,8),当然也不会打印send 1,因为它为null,我应该发送1 shape =(6,8 )有什么帮助吗?
打印报表输出的图像:
答案 0 :(得分:1)
要跳过空白行(包含None值和空字符串),为什么不这样做:
callback