Question

我有一个文本文件，例如：

{"SrcIP_Entropy": "0.8277165068887369", "DstIP_Entropy": "0.879014711680073", "SrcPort_Entropy": "0.8628103819624768", "DstPort_Entropy": "0.839472844582706"}

我正在尝试将数据加载到pyspark中并根据经过训练的模型进行预测，但是我无法做到这一点。数据类型不匹配或其他原因。

这是我用来训练模型的东西。

classifier = LinearSVC(labelCol = 'label', featuresCol = 'features')
pipeline = Pipeline(stages = [assembler, classifier])
(train, test) = data.randomSplit([0.8, 0.2])
model = pipeline.fit(train)
from pyspark.ml.evaluation import BinaryClassificationEvaluator

predictions = model.transform(test)

以上内容适用于来自CSV的训练和测试数据，我能够加载和测试。

现在，我想用我的模型测试多个仅具有功能，缺少标签的功能，这些功能存在于1-20 entropy-1.txt，..... entropy-20.txt的validation_folder中

import io
import json

# make  alist of validation text files
validation_file_list = []
for root, dirs, files in os.walk("validation folder/"):
    validation_file_list = files


for i in validation_file_list:
    print(i)

    with io.open("validation folder/" + i, "r", encoding="utf-8") as my_file:
        my_unicode_string = my_file.read()
        d = json.loads(my_unicode_string)
        single_pred = model.transform(np.array(list(d.values())).reshape(1, -1))
        if single_pred == 0:
            print(i, ": error")

它给了我以下错误，我理解是由于numpy，我需要创建一个pyspark datframe。

然后，我试图加载单个文件以使测试在这些测试文件上运行，但我仍在努力使其正常工作。

原始代码，我尝试加载单个文件，但什么也没有加载。

df = spark.read.option("header", "true") \
    .option("delimiter", ",") \
    .option("inferSchema", "true") \
    .schema(
        StructType(
            [
                StructField('SrcIP_Entropy', DoubleType()),
                StructField('DstIP_Entropy', DoubleType()),
                StructField('SrcPort_Entropy', DoubleType()),
                StructField('DstPort_Entropy', DoubleType())
                    ])
        )\
    .csv("validation folder/entropy-1.txt")
df.show()


+-------------+-------------+---------------+---------------+
|SrcIP_Entropy|DstIP_Entropy|SrcPort_Entropy|DstPort_Entropy|
+-------------+-------------+---------------+---------------+
+-------------+-------------+---------------+---------------+

注意-我是PySpark的新手。任何可以提出建议的专家。

从CSV文件加载测试数据并根据经过训练的模型进行预测

0 个答案: