在我们管理的ETL流程中,我们有时会收到损坏的文件。 我们尝试了这种Spark配置,它似乎可以正常工作(Spark作业不会失败,因为损坏的文件被丢弃了):
spark.sqlContext.setConf("spark.sql.files.ignoreCorruptFiles", "true")
但是我不知道是否有知道哪些文件被忽略的信息。反正有那些文件名吗?
预先感谢
答案 0 :(得分:1)
一种方法是浏览执行程序日志。如果您在Spark配置中将以下配置设置为true。
RDD:spark.files.ignoreCorruptFiles 数据帧:spark.sql.files.ignoreCorruptFiles
然后spark将在执行程序日志中以WARN消息的形式记录损坏的文件。
这是Spark的代码片段,用于执行此操作:
Sub Three_Issues()
Dim ColumnLetter As String
Dim cell As Range
Dim sheetCount, TotalRow, TotalCol As Integer
'Dim item, uniqueArray As Variant
Dim item, uniqueArray() As Variant
Dim lastRow As Long
Application.ScreenUpdating = False
'Get unique brands:
With Sheets("Brand")
.Columns(1).EntireColumn.Delete
Sheets("Sales").Columns("R:R").AdvancedFilter Action:=xlFilterCopy, CopyToRange:=.Range("A1"), Unique:=True
lastRow = .Cells(.Rows.Count, "A").End(xlUp).Row
'uniqueArray = .Range("A3:A" & lastRow)
'Update:
If .Range("A3:A" & lastRow).Cells.Count = 1 Then
ReDim uniqueArray(1, 1)
uniqueArray(1, 1) = .Range("A3")
Else
uniqueArray = .Range("A3:A" & lastRow).Value
End With
TotalRow = Sheets("Sales").UsedRange.Rows.Count
TotalCol = Sheets("Sales").UsedRange.Columns.Count
ColumnLetter = Split(Cells(1, TotalCol).Address, "$")(1) 'Num2Char
sheetCount = 0 'Counter for statusbar
For Each item In uniqueArray 'item=Brand
'->Issue 1: Runtimer error 13 Types don't match: This happens if the uniqueArray consists of only one brand.
'Then item is Variant/Empty and uniqueArray is Variant/String
'If uniqueArray consists of more than one brand - which is usually the case - it works fine.
'item=Variant/Empty uniqueArray=e.g. Variant/Variant(1 to 2, 1 to 1)
'Can I change the Dim statement to solve this special case, or do I need arrays maybe?
'Filter sales for each brand:
With Sheets("Sales")
.Range(.Cells(2, 1), .Cells(TotalRow, TotalCol)).AutoFilter Field:=18, Criteria1:=item
End With
With Sheets("Agents")
'Delete old...
.Range(.Cells(2, 1), .Cells(2, 1).End(xlDown)).Clear
'...and get new
Sheets("Sales").Range(Sheets("Sales").Cells(3, 2), Sheets("Sales").Cells(2, 2).End(xlDown)).SpecialCells(xlCellTypeVisible).Copy
.Range("A2").PasteSpecial Paste:=xlPasteValues
Application.CutCopyMode = False
End With
'List with all agents
For Each cell In Worksheets("Agents").Range("A2", Worksheets("Agents").Range("A1").End(xlDown))
With Sheets("Report")
.Range("I4") = cell 'Copy agent and update the formulas within the report
'->Issue 2: It takes around 10 seconds to fill 10 sheets with the reports of 10 agents.
'When I reach 70-80 sheets, it slows down to 30 seconds for 10 sheets.
'Is this just because of the number of sheets, or can I speed it up again?
.Range(.PageSetup.PrintArea).Copy
Sheets.Add After:=Sheets("Report")
Selection.PasteSpecial Paste:=xlPasteAllUsingSourceTheme, Operation:=xlNone _
, SkipBlanks:=False, Transpose:=False
Selection.PasteSpecial Paste:=xlPasteColumnWidths, Operation:=xlNone, _
SkipBlanks:=False, Transpose:=False
ActiveSheet.UsedRange.Value = ActiveSheet.UsedRange.Value 'Replace all formulas with values
Application.CutCopyMode = False
ActiveSheet.Name = cell
sheetCount = sheetCount + 1
If sheetAnz Mod 10 = 0 Then Application.StatusBar = sheetAnz 'Get statusupdate every 10 sheets
End With
Next
'->Issue 3: I create up to 400 sheets and when I want to continue and do some sorting of the sheets for example it takes a very long time.
'But if I add this break for a second, it works reasonably fine again. Why is that? Does vba needs the break to catch up with itself?
'Since the issue is not the sorting and the other stuff after the pause.
Application.Wait (Now + TimeValue("0:00:01")) 'Code becomes faster after that...
'Continue with other stuff.... sorting sheets and so on
Next
Application.ScreenUpdating = True
End Sub
答案 1 :(得分:0)
您解决了吗?
如果没有,也许您可以尝试以下方法:
您使用其他方法吗?