我尝试使用 Scala 中的其他列值来更新列的值。
这是我的数据框中的数据:
+-------------------+------+------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier| _c0| _c1| _c2| _c3| _c4| _c5|isBadRecord|
+-------------------+------+------+-----+------+----+--------------------+-----------+
| 1| 0| 0| Name| 0|Desc| | 0|
| 2| 2.11| 10000|Juice| 0| XYZ|2016/12/31 : Inco...| 0|
| 3|-0.500|-24.12|Fruit| -255| ABC| 1994-11-21 00:00:00| 0|
| 4| 0.087| 1222|Bread|-22.06| | 2017-02-14 00:00:00| 0|
| 5| 0.087| 1222|Bread|-22.06| | | 0|
+-------------------+------+------+-----+------+----+--------------------+-----------+
此处 _c5 列包含的值不正确(Row2中的值包含字符串不正确),我希望将其isBadRecord字段更新为1。
有没有办法更新这个字段?
答案 0 :(得分:2)
您可以使用http://yazilimsozluk.com/a.xlsx api并使用满足您需求的withColumn之一来填写错误记录。
对于您的情况,您可以编写if (Request != null) {
HttpPostedFileBase file = Request.Files["UploadedFile"];
if ((file != null) && (file.ContentLength > 0) && !string.IsNullOrEmpty(file.FileName)) {
string fileName = file.FileName;
string fileContentType = file.ContentType;
byte[] fileBytes = new byte[file.ContentLength];
var data = file.InputStream.Read(fileBytes, 0, Convert.ToInt32(file.ContentLength));
var existingFile = new System.IO.FileInfo(fileName);
var package = new OfficeOpenXml.ExcelPackage(existingFile);
OfficeOpenXml.ExcelWorksheet workSheet = package.Workbook.Worksheets[0];
for (int i = workSheet.Dimension.Start.Column; i <= workSheet.Dimension.End.Column; i++) {
for (int j = workSheet.Dimension.Start.Row; j <= workSheet.Dimension.End.Row; j++) {
object cellValue = workSheet.Cells[i, j].Value;
}
}
}
}
函数
udf
并将其命名为
def fillbad = udf((c5 : String) => if(c5.contains("Incorrect")) 1 else 0)
答案 1 :(得分:2)
我建议你像在SQL中一样思考它,而不是推理更新它。你可以做到以下几点:
import org.spark.sql.functions.when
val spark: SparkSession = ??? // your spark session
val df: DataFrame = ??? // your dataframe
import spark.implicits._
df.select(
$"UniqueRowIdentifier", $"_c0", $"_c1", $"_c2", $"_c3", $"_c4",
$"_c5", when($"_c5".contains("Incorrect"), 1).otherwise(0) as "isBadRecord")
这是一个自包含的脚本,您可以复制并粘贴到Spark shell上以在本地查看结果:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
sc.setLogLevel("ERROR")
val schema =
StructType(Seq(
StructField("UniqueRowIdentifier", IntegerType),
StructField("_c0", DoubleType),
StructField("_c1", DoubleType),
StructField("_c2", StringType),
StructField("_c3", DoubleType),
StructField("_c4", StringType),
StructField("_c5", StringType),
StructField("isBadRecord", IntegerType)))
val contents =
Seq(
Row(1, 0.0 , 0.0 , "Name", 0.0, "Desc", "", 0),
Row(2, 2.11 , 10000.0 , "Juice", 0.0, "XYZ", "2016/12/31 : Incorrect", 0),
Row(3, -0.5 , -24.12, "Fruit", -255.0, "ABC", "1994-11-21 00:00:00", 0),
Row(4, 0.087, 1222.0 , "Bread", -22.06, "", "2017-02-14 00:00:00", 0),
Row(5, 0.087, 1222.0 , "Bread", -22.06, "", "", 0)
)
val df = spark.createDataFrame(sc.parallelize(contents), schema)
df.show()
val withBadRecords =
df.select(
$"UniqueRowIdentifier", $"_c0", $"_c1", $"_c2", $"_c3", $"_c4",
$"_c5", when($"_c5".contains("Incorrect"), 1).otherwise(0) as "isBadRecord")
withBadRecords.show()
以下是相关的输出:
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier| _c0| _c1| _c2| _c3| _c4| _c5|isBadRecord|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
| 1| 0.0| 0.0| Name| 0.0|Desc| | 0|
| 2| 2.11|10000.0|Juice| 0.0| XYZ|2016/12/31 : Inco...| 0|
| 3| -0.5| -24.12|Fruit|-255.0| ABC| 1994-11-21 00:00:00| 0|
| 4|0.087| 1222.0|Bread|-22.06| | 2017-02-14 00:00:00| 0|
| 5|0.087| 1222.0|Bread|-22.06| | | 0|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
|UniqueRowIdentifier| _c0| _c1| _c2| _c3| _c4| _c5|isBadRecord|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
| 1| 0.0| 0.0| Name| 0.0|Desc| | 0|
| 2| 2.11|10000.0|Juice| 0.0| XYZ|2016/12/31 : Inco...| 1|
| 3| -0.5| -24.12|Fruit|-255.0| ABC| 1994-11-21 00:00:00| 0|
| 4|0.087| 1222.0|Bread|-22.06| | 2017-02-14 00:00:00| 0|
| 5|0.087| 1222.0|Bread|-22.06| | | 0|
+-------------------+-----+-------+-----+------+----+--------------------+-----------+
答案 2 :(得分:1)
最好的选择是创建UDF并尝试将其转换为Date格式。 如果可以转换,则返回0,否则返回1
即使你有一个糟糕的日期格式
,这项工作也是如此 val spark = SparkSession.builder().master("local")
.appName("test").getOrCreate()
import spark.implicits._
//create test dataframe
val data = spark.sparkContext.parallelize(Seq(
(1,"1994-11-21 Xyz"),
(2,"1994-11-21 00:00:00"),
(3,"1994-11-21 00:00:00")
)).toDF("id", "date")
// create udf which tries to convert to date format
// returns 0 if success and returns 1 if failure
val check = udf((value: String) => {
Try(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(value)) match {
case Success(d) => 1
case Failure(e) => 0
}
})
// Add column
data.withColumn("badData", check($"date")).show
希望这有帮助!