数据提取:熊猫列操作

时间:2020-03-27 09:25:11

标签: python pandas

我有这种格式的DataFrame

Col1|Col2
A|Agriculture, forestry and fishing
1|Crop and animal production, hunting and related service activities
11|Growing of non-perennial crops
12|Growing of perennial crops
14|Animal production
C|Manufacturing
11|Manufacture of beverages
110|Manufacture of beverages
12|Manufacture of tobacco products
120|Manufacture of tobacco products 
14|Manufacture of wearing apparel 
141|Manufacture of wearing apparel, except fur apparel

A是项|在A下为1子项,在A下为11,即sub_sub_item。 问题在“ C”下有11个子项时出现

现在我已经完成了以下工作:

Col0_list = df['Col0'].values.tolist()
Col1_list = df['Col1'].values.tolist()

#Defining Empty lists
item = []
sub_item = []
sub_sub = []
#looping through the 
for i in range(len(Col0_list)):
if str(Col0_list[i]).isalpha():
    item.append(Col1_list[i])
    sub_item.append(np.nan)
    sub_sub.append(np.nan)
elif Col0_list[i] < 10 and len(str(Col0_list[i]))==1:
    item.append(np.nan)
    sub_item.append(Col1_list[i])
    sub_sub.append(np.nan)
elif icode_list[i] > 10 and len(str(Col0_list[i]))== 2:
#THIS IS WHERE IT FAILS SINCE '11' is both sub_item and sub_sub

我希望将其转换为以下格式

Item|SubItem|Sub-SubItem
Agriculture, forestry and fishing|Crop and animal production, hunting and related service activities|Growing of non-perennial crops
Agriculture, forestry and fishing|Crop and animal production, hunting and related service activities|Growing of perennial crops
Agriculture, forestry and fishing|Crop and animal production, hunting and related service activities|Animal production
Manufacturing|Manufacture of beverages|Manufacture of beverages
Manufacturing|Manufacture of tobacco products|Manufacture of tobacco products 
Manufacturing|Manufacture of wearing apparel |Manufacture of wearing apparel, except fur apparel

3 个答案:

答案 0 :(得分:0)

使用此方法:

data = [['tom', 10,'M'], ['nick', 15,'M'], ['juli', 14,'F']]
df = pd.DataFrame(data, columns=['Name', 'Age','Gender'])

json_records = df.to_dict('records')

req_json = {}
male_list = []
female_list = []

for item in json_records:
    if item['Gender'] == 'M':
        male_list.append(item['Name'])
    if item['Gender'] == 'F':
        female_list.append(item['Name'])
    req_json['males'] = male_list
    req_json['females'] = female_list
print(req_json)

答案 1 :(得分:0)

我无法想象一种很好的矢量化方式,所以我将循环遍历Col1数据以发现该行是Item,SubItem还是SubSubItem。我会用它来构建结果数据框:

typ=np.zeros(len(df))
for i, key in enumerate(df['Col1']):
    if re.match('[A-Z]+', key, re.I):
        prev = key
    elif key.startswith(prev):
        typ[i] = 2
    else:
        typ[i] = 1
        prev = key

resul = pd.DataFrame(index = df.index, columns=['Item', 'SubItem', 'SubSubItem'])

for i in range(3):
    resul.iloc[:, i] = df.loc[typ == i, 'Col2']

它给出:

                                 Item                                            SubItem                                         SubSubItem
0   Agriculture, forestry and fishing                                                NaN                                                NaN
1                                 NaN  Crop and animal production, hunting and relate...                                                NaN
2                                 NaN                                                NaN                     Growing of non-perennial crops
3                                 NaN                                                NaN                         Growing of perennial crops
4                                 NaN                                                NaN                                  Animal production
5                       Manufacturing                                                NaN                                                NaN
6                                 NaN                           Manufacture of beverages                                                NaN
7                                 NaN                                                NaN                           Manufacture of beverages
8                                 NaN                    Manufacture of tobacco products                                                NaN
9                                 NaN                                                NaN                    Manufacture of tobacco products
10                                NaN                     Manufacture of wearing apparel                                                NaN
11                                NaN                                                NaN  Manufacture of wearing apparel, except fur app...

我们只需填写NaN值并过滤相关行

resul = resul.ffill()[typ == 2].reset_index(drop=True)

获得:

                                Item                                            SubItem                                         SubSubItem
0  Agriculture, forestry and fishing  Crop and animal production, hunting and relate...                     Growing of non-perennial crops
1  Agriculture, forestry and fishing  Crop and animal production, hunting and relate...                         Growing of perennial crops
2  Agriculture, forestry and fishing  Crop and animal production, hunting and relate...                                  Animal production
3                      Manufacturing                           Manufacture of beverages                           Manufacture of beverages
4                      Manufacturing                    Manufacture of tobacco products                    Manufacture of tobacco products
5                      Manufacturing                     Manufacture of wearing apparel  Manufacture of wearing apparel, except fur app...

答案 2 :(得分:0)

虽然有点复杂,但是下面的代码片段可以完成工作。

##### Fetching Col1 indices with String value
string_inndices=[]
for idx,col in enumerate(df['Col1']):
    try:
        int(df.iloc[idx,0])
        #print('Integer')
    except:
        #print('String')
        string_inndices.append(idx)



integer_lengths=[]

for i in range(len(string_inndices)):
    try:
        k=string_inndices[i+1]
        integer_lengths.extend(list(map(lambda x:len(str(x)),df.iloc[string_inndices[i]:string_inndices[i+1],0])))
        first_length=integer_lengths[string_inndices[i]+1]
        first_index=string_inndices[i]+1
        Rows=[]
        for item in range(string_inndices[i]+1,string_inndices[i+1]):
            if integer_lengths[item]>first_length:
                row = [df.iloc[string_inndices[i],1],df.iloc[first_index,1],df.iloc[item,1]]
                Rows.append(row)
            elif integer_lengths[item]==first_length:
                first_index=item
        #print(Rows)
    except:
        integer_lengths.extend(list(map(lambda x:len(str(x)),df.iloc[string_inndices[i]:,0])))
        first_length=integer_lengths[string_inndices[i]+1]
        first_index=string_inndices[i]+1
        for item in range(string_inndices[i]+1,len(df)):
            #print(df.iloc[item,1])
            if integer_lengths[item]>first_length:
                row = [df.iloc[string_inndices[i],1],df.iloc[first_index,1],df.iloc[item,1]]
                Rows.append(row)
            elif integer_lengths[item]==first_length:
                #print(first_length)
                first_index=item
        #print(Rows)

df_new = pd.DataFrame(data=Rows,columns=['Item','SubItem','Sub-SubItem'])

输出表如下

Item    SubItem Sub-SubItem
0   Agriculture, forestry and fishing   Crop and animal production, hunting and relate...   Growing of non-perennial crops
1   Agriculture, forestry and fishing   Crop and animal production, hunting and relate...   Growing of perennial crops
2   Agriculture, forestry and fishing   Crop and animal production, hunting and relate...   Animal production
3   Manufacturing   Manufacture of beverages    Manufacture of beverages
4   Manufacturing   Manufacture of tobacco products Manufacture of tobacco products
5   Manufacturing   Manufacture of wearing apparel  Manufacture of wearing apparel, except fur app...