Spark-忽略损坏的文件

时间:2018-11-29 14:45:57

标签: apache-spark apache-spark-sql

在我们管理的ETL流程中,我们有时会收到损坏的文件。 我们尝试了这种Spark配置,它似乎可以正常工作(Spark作业不会失败,因为损坏的文件被丢弃了):

spark.sqlContext.setConf("spark.sql.files.ignoreCorruptFiles", "true")

但是我不知道是否有知道哪些文件被忽略的信息。反正有那些文件名吗?

预先感谢

2 个答案:

答案 0 :(得分:1)

一种方法是浏览执行程序日志。如果您在Spark配置中将以下配置设置为true。

RDD:spark.files.ignoreCorruptFiles 数据帧:spark.sql.files.ignoreCorruptFiles

然后spark将在执行程序日志中以WARN消息的形式记录损坏的文件。

这是Spark的代码片段,用于执行此操作:

   Sub Three_Issues()
    Dim ColumnLetter As String
    Dim cell As Range
    Dim sheetCount, TotalRow, TotalCol As Integer
    'Dim item, uniqueArray As Variant
    Dim item, uniqueArray() As Variant
    Dim lastRow As Long

    Application.ScreenUpdating = False

    'Get unique brands:
    With Sheets("Brand")
    .Columns(1).EntireColumn.Delete
    Sheets("Sales").Columns("R:R").AdvancedFilter Action:=xlFilterCopy, CopyToRange:=.Range("A1"), Unique:=True
    lastRow = .Cells(.Rows.Count, "A").End(xlUp).Row
    'uniqueArray = .Range("A3:A" & lastRow)
    'Update:
    If .Range("A3:A" & lastRow).Cells.Count = 1 Then
    ReDim uniqueArray(1, 1)
    uniqueArray(1, 1) = .Range("A3")
    Else
    uniqueArray = .Range("A3:A" & lastRow).Value
    End With

    TotalRow = Sheets("Sales").UsedRange.Rows.Count
    TotalCol = Sheets("Sales").UsedRange.Columns.Count
    ColumnLetter = Split(Cells(1, TotalCol).Address, "$")(1) 'Num2Char
    sheetCount = 0 'Counter for statusbar

For Each item In uniqueArray 'item=Brand
'->Issue 1: Runtimer error 13 Types don't match: This happens if the uniqueArray consists of only one brand.
'Then item is Variant/Empty and uniqueArray is Variant/String
'If uniqueArray consists of more than one brand - which is usually the case - it works fine.
'item=Variant/Empty uniqueArray=e.g. Variant/Variant(1 to 2, 1 to 1)
'Can I change the Dim statement to solve this special case, or do I need arrays maybe?

    'Filter sales for each brand:
    With Sheets("Sales")
    .Range(.Cells(2, 1), .Cells(TotalRow, TotalCol)).AutoFilter Field:=18, Criteria1:=item
    End With

    With Sheets("Agents")
    'Delete old...
    .Range(.Cells(2, 1), .Cells(2, 1).End(xlDown)).Clear
    '...and get new
    Sheets("Sales").Range(Sheets("Sales").Cells(3, 2), Sheets("Sales").Cells(2, 2).End(xlDown)).SpecialCells(xlCellTypeVisible).Copy
    .Range("A2").PasteSpecial Paste:=xlPasteValues
    Application.CutCopyMode = False
    End With

    'List with all agents
    For Each cell In Worksheets("Agents").Range("A2", Worksheets("Agents").Range("A1").End(xlDown))

    With Sheets("Report")
    .Range("I4") = cell 'Copy agent and update the formulas within the report
'->Issue 2: It takes around 10 seconds to fill 10 sheets with the reports of 10 agents.
'When I reach 70-80 sheets, it slows down to 30 seconds for 10 sheets.
'Is this just because of the number of sheets, or can I speed it up again?

    .Range(.PageSetup.PrintArea).Copy
    Sheets.Add After:=Sheets("Report")

    Selection.PasteSpecial Paste:=xlPasteAllUsingSourceTheme, Operation:=xlNone _
        , SkipBlanks:=False, Transpose:=False
    Selection.PasteSpecial Paste:=xlPasteColumnWidths, Operation:=xlNone, _
    SkipBlanks:=False, Transpose:=False

    ActiveSheet.UsedRange.Value = ActiveSheet.UsedRange.Value 'Replace all formulas with values
    Application.CutCopyMode = False
    ActiveSheet.Name = cell

    sheetCount = sheetCount + 1
    If sheetAnz Mod 10 = 0 Then Application.StatusBar = sheetAnz 'Get statusupdate every 10 sheets
    End With
    Next

'->Issue 3: I create up to 400 sheets and when I want to continue and do some sorting of the sheets for example it takes a very long time.
'But if I add this break for a second, it works reasonably fine again. Why is that? Does vba needs the break to catch up with itself?
'Since the issue is not the sorting and the other stuff after the pause.

 Application.Wait (Now + TimeValue("0:00:01")) 'Code becomes faster after that...

    'Continue with other stuff.... sorting sheets and so on

Next

    Application.ScreenUpdating = True

End Sub

答案 1 :(得分:0)

您解决了吗?

如果没有,也许您可​​以尝试以下方法:

  1. 从具有该ignoreCorruptFiles设置的位置读取所有内容
  2. 您可以使用input_file_name UDF获得每个记录所属的文件名。找出不同的名字。
  3. 分别获取相应目录中所有对象的列表
  4. 发现差异。

您使用其他方法吗?