想要在列中对分类值进行分组

时间:2018-06-18 17:35:14

标签: python-3.x pandas group-by scikit-learn one-hot-encoding

我正在尝试分组&为一个列'邻居'分配一个数值,其价值如下:#Queens#Jackson Heights#,#Manhattan#Upper East Side#Sutton Place#,#Brooklyn#Williamsburg#,#Bronx#East Bronx#Throgs Neck#。 (值有2,3个有时4,5个标签) 我使用了正常的if else循环,它对前3个值起作用,如附图所示。 但我不确定它是否正常工作。请帮我分组并为那些组分配值。 [我使用的if else循环如下: *

*# Create a list to store the data
grades = []
# For each row in the column,
for row in new_train1['neighborhood']:
    # if more than a value,
    if row > '#Queens#':
        # Append a num grade
        grades.append('1')
    # else, if more than a value,
    elif row > '#Manhattan#':
        # Append a letter grade
        grades.append('2')
    # else, if more than a value,
    elif row > '#Bronx#':
        # Append a letter grade
        grades.append('3')
    # else, if more than a value,
    elif row > '#Brooklyn#':
        # Append a letter grade
        grades.append('4')
    # else, if more than a value,
    else:
        # Append a failing grade
        grades.append('0')

] [1]:https://i.stack.imgur.com/iQ3E8.png

2 个答案:

答案 0 :(得分:0)

请避免粘贴图像和测试打字技巧。如果我正确理解了您的问题,我会做类似的事情

#creating data frame
df = pd.DataFrame({"A":[1,2,3,4,5], "B":["#Queens#Jackson Heights#", "Manhattan#Upper East Side#Sutton Place#", "Bronx#West East Side#", "Manhattan#Upper East Side#", "#Manhattan#Downtown#Chelsea"]})
#creating replacement dictionary
replace_dic = {"Queens":1, "Jackson Heights":2, "Manhattan":3, "Upper East Side":4, "Sutton Place":5,
              "Bronx":6, "West East Side":7, "Downtown":8, 'Chelsea':9}
#replacing
df['C'] = df['B'].str.split("#").apply(lambda x: [replace_dic[i] for i in x if i != ''])
#result
    A   B   C
0   1   #Queens#Jackson Heights#    [1, 2]
1   2   Manhattan#Upper East Side#Sutton Place#     [3, 4, 5]
2   3   Bronx#West East Side#   [6, 7]
3   4   Manhattan#Upper East Side#  [3, 4]
4   5   #Manhattan#Downtown#Chelsea     [3, 8, 9]

根据您的评论,我认为您正在寻找类似的东西

def replacefunc(x):
    x = [i for i in x if i != '']
    return replace_dic[x[0]]
df['D'] = df['B'].str.split("#").apply(replacefunc)

答案 1 :(得分:0)

感谢大家的帮助和投入。我通过简单的拆分删除了标签。 &然后用于循环以仅计算每行中的第一个单词。 它给了我期望的输出,但是却是index out of range error,但是我正在努力。代码如下:

train = pd.DataFrame(train, columns = ['id','listing_type','floor','latitude','longitude','price','beds','baths','total_rooms','square_feet','pet_details','neighborhood'])
    # Create a list to store the data
    grades = []

    # For each row in the column,
    for row in train['neighborhood'].str.split('#'):
        # if more than a value,
        if row[1] == 'Queens':
            # Append a num grade
            grades.append('1')
        # else, if more than a value,
        elif row[1] == 'Manhattan':
            # Append a letter grade
            grades.append('2')
        # else, if more than a value,
        elif row[1] == 'Bronx':
            # Append a letter grade
            grades.append('3')
        # else, if more than a value,
        elif row[1] == 'Brooklyn':
            # Append a letter grade
            grades.append('4')
        # else, if more than a value,
        else:
            # Append a failing grade
            grades.append('0')

`