如何将数据框的列值拆分为多列

时间:2021-07-08 05:48:33

标签: python pandas dataframe split

下面是我的数据框,其中有一列合并在一起,

   PLUGS\nDESIGN\nGEAR
0  700\nDaewoo 8000  Gearless   
1  300\nHyundai 4400  Gearless   
2  600\nSTX 2600  Gearless   
3  200\nB170 \nGeared   
4  362 Wenchong 1700 Mk II \nGeared   
5  252\nRichMax 1550  Gearless   
6  220\nCV 1100 Plus \nGeared   
7  232\nOrskov Mk VII  Gearless   
8  119\nKouan 1000  Gearless   
9  100\nHanjin 700  Gearless

我想将这些列拆分为三个不同的列,即 PLUGS、DESIGN、GEAR。有没有办法做到这一点?

下面是我试过的代码:

new_df[['PLUGS', 'DESIGN', 'GEAR']] = new_df['PLUGS\nDESIGN\nGEAR'].str.split(' ')
                print(new_df)

预期输出:

   PLUGS  DESIGN               GEAR
0  700    Daewoo 8000          Gearless   
1  300    Hyundai 4400         Gearless   
2  600    STX 2600             Gearless   
3  200    B170                 Geared   
4  362    Wenchong 1700 Mk II  Geared   
5  252    RichMax 1550         Gearless   
6  220    CV 1100 Plus         Geared   
7  232    Orskov Mk VII        Gearless   
8  119    Kouan 1000           Gearless   
9  100    Hanjin 700           Gearless

2 个答案:

答案 0 :(得分:2)

正如评论部分所建议的,正则表达式在这里应该可以很好地工作,

数据帧示例:

>>> df
                   PLUGS\nDESIGN\nGEAR
0        700\nDaewoo 8000  Gearless
1       300\nHyundai 4400  Gearless
2           600\nSTX 2600  Gearless
3                200\nB170 \nGeared
4  362 Wenchong 1700 Mk II \nGeared
5       252\nRichMax 1550  Gearless
6        220\nCV 1100 Plus \nGeared
7      232\nOrskov Mk VII  Gearless
8         119\nKouan 1000  Gearless
9            100\nHanjin 700  Gearless

只需从列名中删除换行符即可使可读性也易于使用。

>>> df.columns = df.columns.str.replace(r"\\n", " ", regex=True)

现在,列名没有任何特殊的汽车:

>>> df
                     PLUGS DESIGN GEAR
0        700\nDaewoo 8000  Gearless
1       300\nHyundai 4400  Gearless
2           600\nSTX 2600  Gearless
3                200\nB170 \nGeared
4  362 Wenchong 1700 Mk II \nGeared
5       252\nRichMax 1550  Gearless
6        220\nCV 1100 Plus \nGeared
7      232\nOrskov Mk VII  Gearless
8         119\nKouan 1000  Gearless
9            100\nHanjin 700  Gearless

现在,我们可以使用 pandas.Series.str.extract。使用 regex 方法时,所有命名组 () 将成为结果中的列名。

因为,命名组将成为具有预定义名称(如 0,1,2)的列,因此我们可以使用所需的名称完全重命名它们以获得所需的结果,如下所示:

>>> df = df['PLUGS DESIGN GEAR'].str.extract(r"^(\d+)[\\n\s]+([^\\]+)[\\n\s]+([\\n|^Gear][a-z]+)").rename(columns={0: 'PLUGS', 1: 'DESIGN', 2: 'GEAR'})

结果:

>>> print(df)
  PLUGS                DESIGN      GEAR
0   700          Daewoo 8000   Gearless
1   300         Hyundai 4400   Gearless
2   600             STX 2600   Gearless
3   200                 B170     Geared
4   362  Wenchong 1700 Mk II     Geared
5   252         RichMax 1550   Gearless
6   220         CV 1100 Plus     Geared
7   232        Orskov Mk VII   Gearless
8   119           Kouan 1000   Gearless
9   100           Hanjin 700   Gearless

正则表达式解释:

您可以在regex101.com

查看
(\d+)[\\n\s]+([^\\]+)[\\n\s]+([\|^Gear][a-z]+)

第一个捕获组 (\d+)

    \d matches a digit (equivalent to [0-9])
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    Match a single character present in the list below [\\n\s]
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    \\ matches the character \ literally (case sensitive)
    n matches the character n literally (case sensitive)
    \s matches any whitespace character (equivalent to [\r\n\t\f\v ])

第二捕获组 ([^\]+)

    Match a single character not present in the list below [^\\]
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    \\ matches the character \ literally (case sensitive)
    Match a single character present in the list below [\\n\s]
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    \\ matches the character \ literally (case sensitive)
    n matches the character n literally (case sensitive)
    \s matches any whitespace character (equivalent to [\r\n\t\f\v ])

第三捕获组 ([|^Gear][a-z]+)

Match a single character present in the list below [\|^Gear]
\| matches the character | literally (case sensitive)
^Gear matches a single character in the list ^Gear (case sensitive)
Match a single character present in the list below [a-z]
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
a-z matches a single character in the range between a (index 97) and z (index 122) (case sensitive)
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)

答案 1 :(得分:1)

从您的数据框开始:

>>> import pandas as pd

>>> df = pd.DataFrame({'PLUGS\nDESIGN\nGEAR': ['700\nDaewoo 8000  Gearless', '300\nHyundai 4400  Gearless', '600\nSTX 2600  Gearless', '200\nB170 \nGeared', '362 Wenchong 1700 Mk II \nGeared', '252\nRichMax 1550  Gearless'], }, 
...                   index = [0, 1, 2, 3, 4, 5]) 
>>> df
    PLUGS\nDESIGN\nGEAR
0   700\nDaewoo 8000 Gearless
1   300\nHyundai 4400 Gearless
2   600\nSTX 2600 Gearless
3   200\nB170 \nGeared
4   362 Wenchong 1700 Mk II \nGeared
5   252\nRichMax 1550 Gearless

您确实可以在多个分隔符上使用 split 方法,此处为 \nspace

>>> df = pd.DataFrame(df['PLUGS\nDESIGN\nGEAR'].str.split('\n| '))
    PLUGS\nDESIGN\nGEAR
0   [700, Daewoo, 8000, , Gearless]
1   [300, Hyundai, 4400, , Gearless]
2   [600, STX, 2600, , Gearless]
3   [200, B170, , Geared]
4   [362, Wenchong, 1700, Mk, II, , Geared]
5   [252, RichMax, 1550, , Gearless]

然后,您可以将第一个和最后一个元素分配给正确的列,其余的分配给 DESIGN 列:

>>> df['PLUGS'] = df['PLUGS\nDESIGN\nGEAR'].str[0]
>>> df['DESIGN'] = df['PLUGS\nDESIGN\nGEAR'].str[1:-1]
>>> df['GEAR'] = df['PLUGS\nDESIGN\nGEAR'].str[-1]
>>> df
    PLUGS\nDESIGN\nGEAR                         PLUGS   DESIGN                      GEAR
0   [700, Daewoo, 8000, , Gearless]             700     [Daewoo, 8000, ]            Gearless
1   [300, Hyundai, 4400, , Gearless]            300     [Hyundai, 4400, ]           Gearless
2   [600, STX, 2600, , Gearless]                600     [STX, 2600, ]               Gearless
3   [200, B170, , Geared]                       200     [B170, ]                    Geared
4   [362, Wenchong, 1700, Mk, II, , Geared]     362     [Wenchong, 1700, Mk, II, ]  Geared
5   [252, RichMax, 1550, , Gearless]            252     [RichMax, 1550, ]           Gearless

最后要做的是改进 DESIGN 列以使用 join 方法将其映射为字符串而不是列表,并删除 PLUGS\nDESIGN\nGEAR 列,如下所示:< /p>

>>> df['DESIGN'] = df['DESIGN'].apply(lambda x: ' '.join(map(str, x)))
>>> df.drop(['PLUGS\nDESIGN\nGEAR'], axis=1)
    PLUGS   DESIGN               GEAR
0   700     Daewoo 8000          Gearless
1   300     Hyundai 4400         Gearless
2   600     STX 2600             Gearless
3   200     B170                 Geared
4   362     Wenchong 1700 Mk II  Geared
5   252     RichMax 1550         Gearless
相关问题