编辑

Question

我正在从csv读取数据以执行功能消除。这是数据的样子

shift_id    user_id status  organization_id location_id department_id   open_positions  city    zip role_id specialty_id    latitude    longitude   years_of_experience
0   2   9   S   1   1   19  1   brooklyn    48001.0 2.0 9.0 42.643  -82.583 NaN
1   6   60  S   12  19  20  1   test    68410.0 3.0 7.0 40.608  -95.856 NaN
2   9   61  S   12  19  20  1   new york    48001.0 1.0 7.0 42.643  -82.583 NaN
3   10  60  S   12  19  20  1   test    68410.0 3.0 7.0 40.608  -95.856 NaN
4   21  3   S   1   1   19  1   pune    48001.0 1.0 2.0 46.753  -89.584 0.0

这是我的代码-

dataset = pd.read_csv("data.csv",header = 0)
data = pd.read_csv("data.csv",header = 1)
target = dataset.location_id
#dataset.head()
svm = LinearSVC()
rfe = RFE(svm, 3)
rfe = rfe.fit(data, target)
print(rfe.support_)
print(rfe.ranking_)

但是我收到此错误

ValueError: could not convert string to float: '1,141'

我的数据库中没有这样的字符串。

有一些空单元格。所以我尝试使用-

result.fillna(0, inplace=True)

哪个给了这个错误

ValueError: Expected 2D array, got scalar array instead:
array=None.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

有没有建议如何正确处理这些数据？

这里是示例数据的链接-https://gist.github.com/karimkhanp/6db4f9f9741a16e46fc294b8e2703dc7

Answer 1

您的问题包含result.fillna(0, inplace=True)。

但是，由于result从未出现过，因此不清楚它的值是多少（可能是标量）。

代码中另一个奇怪的细节。看：

dataset = pd.read_csv("prod_data_for_ML.csv",header = 0)
data = pd.read_csv("prod_data_for_ML.csv",header = 1)

请注意，您从同一文件中读取了两次，但是：

您第一次使用header = 0阅读，因此，作为文档状态，列名从第一行推断，
第二次使用header = 1阅读。

这是您的意图吗？还是两个通话header都应该相同？

再说一遍：（在我看来）从同一文件读取2次不必要。也许您的代码应该像这样：

data = pd.read_csv("prod_data_for_ML.csv",header = 0)
target = data.location_id

编辑

我想从您的评论中脱颖而出，您想要：

第一张表-dataset-带有第一列（shift_id），
第二个表-data-没有此列。

然后您的代码应包含：

dataset = pd.read_csv("data.csv",header = 0)  # Read the whole source file, reading column names from the starting row
data = dataset.drop(columns='shift_id')       # Copy dropping "shift_id" column
...

请注意，header=1不会“跳过”任何列，而仅说明从哪个源行读取列名。在这种情况下：

第0行（起始行，包含实际的列名）为跳过。
从下一个行中读取列名（由于header=1），实际上包含第一行数据。
仅剩余行被读入目标表的行。

如果您要“跳过”某些源列，请用read_csv调用usecols 参数，但它指定要读取的列（不要跳过）。

因此，假设您的源文件有14列（从0到13编号），而您只想省略第一个数字（0），可以这样写 usecols=[*range(1, 14)]（请注意，上限（14）为不是包括在范围内。）

关于数据样本的另一点评论：第一列是索引，没有任何名称。 shift_id是下一个列，因此，为避免混淆，您应该在第一行添加一些缩进。

请注意，City列位于标题中的第 8 列，但位于数据行中（布鲁克林，测试）在位置 9 。因此，“标题”行（列名称）应缩进。

编辑2

看看您对这个问题的评论，写于2019-02-14 12：40：19Z。它包含这样的行：

"1,141","1,139",A,14,24,77,1,OWINGS MILLS,"21117"

它显示前2列（shift_id和user_id）包含浮点数的字符串表示形式，但带有逗号而不是点。

您可以使用自己的转换器功能来解决此问题，例如：

def cnvToFloat(x):
    return float(x.replace(',', '.'))

并调用read_csv，并在convertes参数中传递此函数，以获取这样的“必填”（格式错误）列，例如：

dataset = pd.read_csv("data.csv", header = 0, 
    converters={'shift_id': cnvToFloat, 'user_id': cnvToFloat})

Answer 2

1,141是无效的浮点数。

要将其转换为浮点型，应首先将其转换为有效类型，将,替换为.，然后将其强制转换为float。

bad_float = '1,141'

print(float(bad_float.replace(",",".")))

输出：

1.141

编辑：

如@ShadowRanger所指出的那样，除非实际上以逗号为分隔数字分组的逗号（以使其更易于阅读）：

comm_sep = '1,141'

res = comm_sep.split(",")

print(float(res[0]), float(res[1]))

输出：

1.0 141.0

编辑2：

OP通过将csv文件编辑器中的column type更改为number的问题得到了OP的解决。

Answer 3

您的ValueError: could not convert string to float: '1,141'的解决方案是在thousands中使用pd.read_csv()参数：

dataset = pd.read_csv("data.csv",header = 0, thousands= r",")
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 14 columns):
shift_id                3 non-null int64
user_id                 3 non-null int64
status                  3 non-null object
organization_id         3 non-null int64
location_id             3 non-null int64
department_id           3 non-null int64
open_positions          3 non-null int64
city                    3 non-null object
zip                     3 non-null int64
role_id                 3 non-null int64
specialty_id            2 non-null float64
latitude                3 non-null float64
longitude               3 non-null float64
years_of_experience     3 non-null object
dtypes: float64(3), int64(8), object(3)
memory usage: 416.0+ bytes

ValueError：无法将字符串转换为float：'1,141'

3 个答案:

编辑

编辑2