带有双括号的熊猫索引[[]]

时间:2018-07-23 09:48:39

标签: python python-3.x pandas tensorflow

我一直在学习Google的机器学习速成课程,并且他们有一个部分,其中有一个练习教您如何使用pandas和tensorflow。在开始时,他们抓住数据帧,并在紧接着抓住“ total_rooms”和“ median_house_value”系列之后。他们用双括号抓住“ total_rooms”系列,而只用一组括号抓住“ median_house_value”系列。我通读了panda的文档,似乎您需要使用双括号索引到一系列索引中的唯一原因是立即索引2列,即数据california_housing_dataframe [[“ median_house_value”,“ total_rooms”]]。他们为什么要使用双括号从数据框中仅索引数据列中的一列,而稍后再使用单括号看起来似乎是相同的呢?

这是我正在谈论的代码。

california_housing_dataframe = pd.read_csv("https://dl.google.com/mlcc/mledu-datasets/california_housing_train.csv", sep=",")
# Define the input feature: total_rooms.
my_feature = california_housing_dataframe[["total_rooms"]]
# Configure a numeric feature column for total_rooms.
feature_columns = [tf.feature_column.numeric_column("total_rooms")]

targets = california_housing_dataframe["median_house_value"]

如果需要更多上下文,请参见以下代码:

california_housing_dataframe = pd.read_csv("https://dl.google.com/mlcc/mledu-datasets/california_housing_train.csv", sep=",")

# Define the input feature: total_rooms.
my_feature = california_housing_dataframe[["total_rooms"]]
# Configure a numeric feature column for total_rooms.
feature_columns = [tf.feature_column.numeric_column("total_rooms")]

targets = california_housing_dataframe["median_house_value"]

def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None):
    """Trains a linear regression model of one feature.

    Args:
      features: pandas DataFrame of features
      targets: pandas DataFrame of targets
      batch_size: Size of batches to be passed to the model
      shuffle: True or False. Whether to shuffle the data.
      num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely
    Returns:
      Tuple of (features, labels) for next data batch
    """

    # Convert pandas data into a dict of np arrays.
    features = {key:np.array(value) for key,value in dict(features).items()}                                           

    # Construct a dataset, and configure batching/repeating.
    ds = Dataset.from_tensor_slices((features,targets)) # warning: 2GB limit
    ds = ds.batch(batch_size).repeat(num_epochs)

    # Shuffle the data, if specified.
    if shuffle:
      ds = ds.shuffle(buffer_size=10000)

    # Return the next batch of data.
    features, labels = ds.make_one_shot_iterator().get_next()
    return features, labels

prediction_input_fn =lambda: my_input_fn(my_feature, targets, num_epochs=1, shuffle=False)

# Call predict() on the linear_regressor to make predictions.
predictions = linear_regressor.predict(input_fn=prediction_input_fn)

如果您需要更多上下文,请使用以下所有代码的链接: https://colab.research.google.com/notebooks/mlcc/first_steps_with_tensor_flow.ipynb?utm_source=mlcc&utm_campaign=colab-external&utm_medium=referral&utm_content=firststeps-colab&hl=en

2 个答案:

答案 0 :(得分:3)

单括号产生大熊猫系列,但是双括号产生大熊猫数据框。

这里是一个示例:

d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df
   col1 col2
0   1   3
1   2   4

现在让我们同时使用双括号和单括号来打印类型。

单括号产生:

type(df["col1"])
pandas.core.series.Series

双括号产生:

type(df[["col1"]])
pandas.core.frame.DataFrame

因此,现在您看到了区别,单括号和双括号索引之间的差异具有两个不同的目的。如果要从数据框中的现有列中创建新的数据框,请使用双括号。

这也是一个类似的答案,但有更多的解释。 The difference between double brace `[[...]]` and single brace `[..]` indexing in Pandas

答案 1 :(得分:1)

my_feature <class 'pandas.core.frame.DataFrame'>

目标 <classpandas.core.series.Series'>

但是许多功能都可以在这两种数据结构上工作。我什至可以将两者都传递给matplotlib函数。

在研究差异时,我发现它已经得到了解释here

import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt

california_housing_dataframe = pd.read_csv("https://dl.google.com/mlcc/mledu-datasets/california_housing_train.csv", sep=",")
# Define the input feature: total_rooms.
my_feature = california_housing_dataframe[["total_rooms"]]
print(type(my_feature))
# Configure a numeric feature column for total_rooms.
feature_columns = [tf.feature_column.numeric_column("total_rooms")]

targets = california_housing_dataframe["median_house_value"]
print(type(targets))

print( my_feature.describe())
print( targets.describe())

print( my_feature.head())
print( targets.head())

print( my_feature.max())
print( targets.max())