在数据框单元格中查找特定的字符串并标记它们

时间:2019-05-28 04:52:24

标签: pandas

嗨,我想遍历dataframe列的行以在单元格中找到特定的字符串,并在找到它们时对其进行标记。

数据框(还有数千行)

ID  PRICE   TEXT                                             AFRICA    ASIA    EUROPE  BRITAIN / IRELAND / SCOTLAND NOTHING  FROM TO STAIRES_COUNT  MAX_CLEARANCE_M    3 seater couch    4 seater+ couch   < 60 inches TV   60 inches+ TV   Washer - front loader   Box / bag / misc
AAA 1000    travel to africa x 2\n\nAirport pick up included. 
            from A to B. there are 2 flights of staires.
            Furnitures 3 seater couch x 1
            4 seater+ couch x 1
            < 60 inches TV x 1
            60 inches+ TV x 1
            Washer - front loader x 1
            Box / bag / misc x 1
            The maximum clearance is 1.5m.
BBB 1000    trip to asia x 2\n\nWorthwhile experience.        
            from C to D
            The maximum clearance is 1.5m.  
CCC 1000    holiday in europe x2\n\nLocal experience.         
            from C to D     
DDD 1000   holiday in Fiji x2\n\nLocal experience.            
            from A to D. there are 2 flights of staires.

continents = ['africa', 'asia', 'europe', '3 seater couch','4 seater+ couch','<60 inches TV','60 inches+ TV','Washer - front loader','Box / bag / misc'](列表很长,这是其中的一部分)。

我想从continents列中找到TEXT,并标记出特定大陆有多少人。
我要查找并标记的其他字符串是from '...' to '...'There are '...' flights of stairs.The maximum clearance is '...'m.'...'是我要查找的字符串的位置。

理想的输出

    ID  PRICE   TEXT                                                                        AFRICA    ASIA    EUROPE  BRITAIN / IRELAND / SCOTLAND NOTHING  FROM TO STAIRES_COUNT  MAX_CLEARANCE_M  3 seater couch   4 seater+ couch   < 60 inches TV  60 inches+ TV   Washer - front loader   Box / bag / misc

    AAA 1000    travel to africa x 2\nasia x 2\neurope x 2\n\nAirport pick up included.     2         2        2        0                           0         A    B   2                1.5.           1                   1              1                 1                       1             1   
                from A to B. there are 2 flights of staires. Furnitures 3 seater couch x 1
                4 seater+ couch x 1
                < 60 inches TV x 1
                60 inches+ TV x 1
                Washer - front loader x 1
                Box / bag / misc x 1
                The maximum clearance is 1.5m.
    BBB 1000    trip to asia x 2\nBRITAIN / IRELAND / SCOTLAND x 2 \n\nWorthwhile experience.0         2       0        2                           0         C    D   0              0
                from C to D
                The maximum clearance is 1.5m.  
    CCC 1000    holiday in europe x2\n\nLocal experience.                                   0           0       2       0                           0         C    D   0              0
                from C to D     
    DDD 1000   holiday in Fiji x2\n\nLocal experience.                                      0           0       0       0                           2         A    D   2              0
                from A to D. there are 2 flights of stairs.

我正在考虑使用两个for循环来迭代行

for i in df.iterrows():
    for j in continents:
        if j in i:

但是我不知道如何在数据框中标记它们。

2 个答案:

答案 0 :(得分:2)

解决方案:

continents = ['BRITAIN / IRELAND / SCOTLAND', 'africa', 'asia', 'europe', 
              '3 seater couch','4 seater+ couch','60 inches TV','60 inches+ TV',
              'Washer - front loader','Box / bag / misc']
#print (df)

#https://stackoverflow.com/a/56355175/2901002
df1 = df['TEXT'].str.extractall(r'(.*?)\s+x\s*(\d+)')

#get values from continents list
pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in continents)
df1[0] = df1[0].str.extract('('+ pat + ')', expand=False).fillna('NOTHING').str.upper()
df1[1] = df1[1].astype(int)

df1 = df1.set_index(0, append=True)[1].unstack(fill_value=0).sum(level=0)
print (df1)
0  3 SEATER COUCH  4 SEATER+ COUCH  60 INCHES TV  60 INCHES+ TV  AFRICA  ASIA  \
0               1                1             1              1       2     2   
1               0                0             0              0       0     2   
2               0                0             0              0       0     0   
3               0                0             0              0       0     0   

0  BOX / BAG / MISC  BRITAIN / IRELAND / SCOTLAND  EUROPE  NOTHING  \
0                 1                             0       2        0   
1                 0                             2       0        0   
2                 0                             0       2        0   
3                 0                             0       0        2   

0  WASHER - FRONT LOADER  
0                      1  
1                      0  
2                      0  
3                      0  

#add to original dataframe
df = df[['ID','PRICE','TEXT']].join(df1)

#extract another values
df[['FROM','TO']] = df['TEXT'].str.extract('from ([A-Za-z]+)\s+to\s+([A-Za-z]+)')

df['STAIRES_COUNT'] = (df['TEXT'].str.extract('(\d+)\s+flights of staires')
                                .fillna('0').astype(int))
df['MAX_CLEARANCE_M'] = df['TEXT'].str.extract('(\d+\.\d+|\d+)m.').fillna('0').astype(float) 

print (df)
    ID  PRICE                                               TEXT  \
0  AAA   1000  travel to africa x 2\ asia x 2\ europe x 2\ Ai...   
1  BBB   1000  trip to asia x 2\ BRITAIN / IRELAND / SCOTLAND...   
2  CCC   1000  holiday in europe x2\Local experience.   from ...   
3  DDD   1000  holiday in Fiji x2\ Local experience. from A t...   

   3 SEATER COUCH  4 SEATER+ COUCH  60 INCHES TV  60 INCHES+ TV  AFRICA  ASIA  \
0               1                1             1              1       2     2   
1               0                0             0              0       0     2   
2               0                0             0              0       0     0   
3               0                0             0              0       0     0   

   BOX / BAG / MISC  BRITAIN / IRELAND / SCOTLAND  EUROPE  NOTHING  \
0                 1                             0       2        0   
1                 0                             2       0        0   
2                 0                             0       2        0   
3                 0                             0       0        2   

   WASHER - FRONT LOADER FROM TO  STAIRES_COUNT  MAX_CLEARANCE_M  
0                      1    A  B              2              1.5  
1                      0    C  D              0              1.5  
2                      0    C  D              0              0.0  
3                      0    A  D              2              0.0  

答案 1 :(得分:0)

数据

    ID    PRICE                                               TEXT  AFRICA  \
0  AAA     1000  travel to africa x 2 asia x 5 Airport pick up ...       0   
1  BBB     1000          trip to asia x 1 Worthwhile experience.         0   
2  CCC     1000             holiday in europe x2 Local experience.       0   

   ASIA  EUROPE  NOTHING  
0     0       0        0  
1     0       0        0  
2     0       0        0  

data = df.TEXT.apply(lambda x:  [re.findall('(?<='+a+'\sx)'+'\s?\d+',x) for a in list(df.columns[3:6].str.lower())])

for i in df.index:
    df.loc[i,df.columns[3:6]] = [j[0] if j else 0 for j in l[i]]

输出

    ID    PRICE                                               TEXT AFRICA  \
0  AAA     1000  travel to africa x 2 asia x 5 Airport pick up ...      2   
1  BBB     1000          trip to asia x 1 Worthwhile experience.        0   
2  CCC     1000             holiday in europe x2 Local experience.      0   

  ASIA EUROPE  NOTHING  
0    5      0        0  
1    1      0        0  
2    0      2        0