我正在使用Pandas将数据集导入为数据框。
import pandas as pd
import numpy as np
# import file as dataframe from Working directory
df1 = pd.read_excel('20180905_NAICS_to_GCD_industry.xlsx', sheet_name = 0)
# rename columns
df1 = df1.rename(columns = {'NAICS 2012' : 'NAICS', 'GCD Industry code':'GCD_Code', 'Mapped GCD Industry':'GCD'})
我正在尝试检查GCD列中每个因子在数据帧的哪些行中。
例如,为
np.where(df1['GCD'].eq('Private Sector Services (Household)'))
Out[32]:
(array([1246, 1247, 1248, 1249, 1250, 1251, 1252, 1253, 1254, 1257, 1258,
1259, 1260, 1261, 1262, 1263, 1264, 1265, 1266, 1267, 1268, 1269,
1272, 1273, 1274, 1275, 1276, 1277, 1279, 1280, 1281, 1282, 1283,
1284, 1285, 1286, 1287, 1288, 1289, 1290, 1291, 1292, 1293, 1294,
1295, 1296, 1297, 1298, 1299], dtype=int64),)
这是我的期望。但是当我这样做时:
np.where(df1.eq('Public Administration and Defence'))
Out[30]:
(array([ 942, 1300, 1301, 1302, 1303, 1304, 1305, 1306, 1307, 1308, 1309,
1310, 1311, 1312, 1313, 1314, 1315, 1316, 1317, 1318, 1319, 1320,
1321, 1322, 1323, 1324, 1325, 1326, 1327, 1328, 1329, 1330, 1331,
1332, 1333, 1334, 1335, 1336, 1337, 1338, 1339, 1340, 1341, 1342,
1343, 1344, 1345, 1346, 1347, 1348, 1349, 1350, 1351, 1352, 1353],
dtype=int64),
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int64))
我得到两个数组,这产生了一个问题。
有人可以向我解释这个问题的根源是什么,我该如何纠正呢?
以下是我的数据框的一些信息:
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1355 entries, 0 to 1354
Data columns (total 3 columns):
NAICS 1355 non-null int64
GCD_Code 1355 non-null int64
GCD 1355 non-null object
dtypes: int64(2), object(1)
memory usage: 31.8+ KB