从R到Python的过渡:我的关卡在哪里?

时间:2018-02-03 22:52:19

标签: python pandas

如果我有这样的数据框

Function loadLabelCSV(runpath As String) As String()

    Dim arr(5, 0) As String
    Dim x As Integer
    Dim line As String
    Dim lineArr() As String
    Dim reader As New StreamReader(runpath & "\labelTypes.csv", Encoding.Default)

    If System.IO.File.Exists(runpath & "\labelTypes.csv") = False Then
        MsgBox("The label types file is missing please check.", vbCritical)
    End If

    Do
        line = reader.ReadLine
        If line = "" Then Exit Do
        lineArr = Split(reader.ReadLine, ",")
        For y = 0 To 5
            arr(y, x) = lineArr(y)
        Next
        x = x + 1
        ReDim Preserve arr(5, UBound(arr, 2) + 1)
    Loop

    Return arr

End Function

我想找出'moreLabels'的所有可能值,是否有一种简单的方法可以做到这一点?我正在透视并列出数据透视表的列,如下所示:

df = pd.DataFrame({'labels': ['A', 'B', 'C'], 'moreLabels': ['D','E','F'], 
'numbers': [1,2,3] })

,但这需要几个步骤,我想有一个整洁的方式像

这样做
pivot = df.pivot_table(values = 'numbers', index = 'labels', 
columns = 'moreLabels'
list(pivot.columns)

1 个答案:

答案 0 :(得分:4)

R' levels()函数将列出变量的所有可能值,即使这些值不在数据框中。熊猫不会以这种方式行事。

> df <- data.table(moreLabels = c('D', 'E', 'F'), numbers = c(1, 2, 3))
> df[, moreLabels := as.factor(moreLabels)]
> df[, levels(moreLabels)]
[1] "D" "E" "F"

> df[numbers > 1, ]  # if we subset, we only see values "E" and "F"
   moreLabels numbers
1:          E       2
2:          F       3

> df[numbers > 1, levels(moreLabels)]
[1] "D" "E" "F"  # even though we would expect only "E" and "F"

如果您要查找列中显示的唯一值,请使用pd.Series.unique()功能。

>>> df['moreLabels'].unique()
['D', 'E', 'F']