spark rdd mapValues

时间:2017-10-23 15:53:12

标签: python apache-spark pyspark rdd

问题:

 Private Sub cmdOK_Click()
        Dim sh1 As Worksheet
        Dim sh2 As Worksheet
        Dim i As Integer, hdn2 As String
        Dim hdn3 As Variant, col As Long, header As Range
        Set sh2 = ActiveWorkbook.Sheets("Sheet2")

   Set header = sh2.Range("2:2")

       For i = 0 To Me.lstMultiChoice.ListCount - 1
       col = Application.WorksheetFunction.Match(lstMultiChoice.List(i, 0), header, 0)
       If Me.lstMultiChoice.Selected(i) = True Then
       sh2.Columns(col).Hidden = False
       hdn2 = hdn2 & lstMultiChoice.List(i, 0) & ","
    Else
       sh2.Columns(col).Hidden = True
    End If
  Next
      Set sh1 = ActiveWorkbook.Sheets("List Data")
      hdn3 = Split(Left(hnd2, Len(hdn2) - 1), ",")
      sh2.Range(sh2.Cells(4, 4), sh2.Cells(4 + UBound(hdn3), 4)).Value = Application.Transpose(hdn3)
    End Sub

我只想计算每个文件的'行数',例如:

x = sc.wholeTextFiles("hdfs:///user/ydzhao/2017/05/01")
print(x.collect())
# [('hdfs://meihui/user/ydzhao/2017/05/01/TRAFF_20170501200100.txt', 'dkdksfk fjd 123 765\ndsfdf dfgd 53 23\n'),
   ('hdfs://meihui/user/ydzhao/2017/05/01/TRAFF_20170501200200.txt', 'sdsddfdsk gfrged 454 7455')]

count = x.mapValues(lambda line:len(line)).collect()
print(count)
# [('hdfs://meihui/user/ydzhao/2017/05/01/TRAFF_20170501200100.txt', 37), ('hdfs://meihui/user/ydzhao/2017/05/01/TRAFF_20170501200200.txt', 25)]

但是当我使用[('hdfs://meihui/user/ydzhao/2017/05/01/TRAFF_20170501200100.txt', 'dkdksfk fjd 123 765\ndsfdf dfgd 53 23\n'), ('hdfs://meihui/user/ydzhao/2017/05/01/TRAFF_20170501200200.txt', 'sdsddfdsk gfrged 454 7455')] 时,我得到的结果如下:

count = x.mapValues(lambda line:len(line)).collect()
实际上,我需要:

[('hdfs://meihui/user/ydzhao/2017/05/01/TRAFF_20170501200100.txt', 37),
 ('hdfs://meihui/user/ydzhao/2017/05/01/TRAFF_20170501200200.txt', 25)]

1 个答案:

答案 0 :(得分:0)

使用str.splitlines

x.mapValues(lambda s: len(s.splitlines()))