我一直在学习Google的机器学习速成课程,并且他们有一个部分,其中有一个练习教您如何使用pandas和tensorflow。在开始时,他们抓住数据帧,并在紧接着抓住“ total_rooms”和“ median_house_value”系列之后。他们用双括号抓住“ total_rooms”系列,而只用一组括号抓住“ median_house_value”系列。我通读了panda的文档,似乎您需要使用双括号索引到一系列索引中的唯一原因是立即索引2列,即数据california_housing_dataframe [[“ median_house_value”,“ total_rooms”]]。他们为什么要使用双括号从数据框中仅索引数据列中的一列,而稍后再使用单括号看起来似乎是相同的呢?
这是我正在谈论的代码。
california_housing_dataframe = pd.read_csv("https://dl.google.com/mlcc/mledu-datasets/california_housing_train.csv", sep=",")
# Define the input feature: total_rooms.
my_feature = california_housing_dataframe[["total_rooms"]]
# Configure a numeric feature column for total_rooms.
feature_columns = [tf.feature_column.numeric_column("total_rooms")]
targets = california_housing_dataframe["median_house_value"]
如果需要更多上下文,请参见以下代码:
california_housing_dataframe = pd.read_csv("https://dl.google.com/mlcc/mledu-datasets/california_housing_train.csv", sep=",")
# Define the input feature: total_rooms.
my_feature = california_housing_dataframe[["total_rooms"]]
# Configure a numeric feature column for total_rooms.
feature_columns = [tf.feature_column.numeric_column("total_rooms")]
targets = california_housing_dataframe["median_house_value"]
def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None):
"""Trains a linear regression model of one feature.
Args:
features: pandas DataFrame of features
targets: pandas DataFrame of targets
batch_size: Size of batches to be passed to the model
shuffle: True or False. Whether to shuffle the data.
num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely
Returns:
Tuple of (features, labels) for next data batch
"""
# Convert pandas data into a dict of np arrays.
features = {key:np.array(value) for key,value in dict(features).items()}
# Construct a dataset, and configure batching/repeating.
ds = Dataset.from_tensor_slices((features,targets)) # warning: 2GB limit
ds = ds.batch(batch_size).repeat(num_epochs)
# Shuffle the data, if specified.
if shuffle:
ds = ds.shuffle(buffer_size=10000)
# Return the next batch of data.
features, labels = ds.make_one_shot_iterator().get_next()
return features, labels
prediction_input_fn =lambda: my_input_fn(my_feature, targets, num_epochs=1, shuffle=False)
# Call predict() on the linear_regressor to make predictions.
predictions = linear_regressor.predict(input_fn=prediction_input_fn)
如果您需要更多上下文,请使用以下所有代码的链接: https://colab.research.google.com/notebooks/mlcc/first_steps_with_tensor_flow.ipynb?utm_source=mlcc&utm_campaign=colab-external&utm_medium=referral&utm_content=firststeps-colab&hl=en
答案 0 :(得分:3)
单括号产生大熊猫系列,但是双括号产生大熊猫数据框。
这里是一个示例:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df
col1 col2
0 1 3
1 2 4
现在让我们同时使用双括号和单括号来打印类型。
单括号产生:
type(df["col1"])
pandas.core.series.Series
双括号产生:
type(df[["col1"]])
pandas.core.frame.DataFrame
因此,现在您看到了区别,单括号和双括号索引之间的差异具有两个不同的目的。如果要从数据框中的现有列中创建新的数据框,请使用双括号。
这也是一个类似的答案,但有更多的解释。 The difference between double brace `[[...]]` and single brace `[..]` indexing in Pandas
答案 1 :(得分:1)
my_feature 是 <class 'pandas.core.frame.DataFrame'>
目标是 <classpandas.core.series.Series'>
但是许多功能都可以在这两种数据结构上工作。我什至可以将两者都传递给matplotlib函数。
在研究差异时,我发现它已经得到了解释here
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
california_housing_dataframe = pd.read_csv("https://dl.google.com/mlcc/mledu-datasets/california_housing_train.csv", sep=",")
# Define the input feature: total_rooms.
my_feature = california_housing_dataframe[["total_rooms"]]
print(type(my_feature))
# Configure a numeric feature column for total_rooms.
feature_columns = [tf.feature_column.numeric_column("total_rooms")]
targets = california_housing_dataframe["median_house_value"]
print(type(targets))
print( my_feature.describe())
print( targets.describe())
print( my_feature.head())
print( targets.head())
print( my_feature.max())
print( targets.max())