下面是我的数据框,其中有一列合并在一起,
PLUGS\nDESIGN\nGEAR
0 700\nDaewoo 8000 Gearless
1 300\nHyundai 4400 Gearless
2 600\nSTX 2600 Gearless
3 200\nB170 \nGeared
4 362 Wenchong 1700 Mk II \nGeared
5 252\nRichMax 1550 Gearless
6 220\nCV 1100 Plus \nGeared
7 232\nOrskov Mk VII Gearless
8 119\nKouan 1000 Gearless
9 100\nHanjin 700 Gearless
我想将这些列拆分为三个不同的列,即 PLUGS、DESIGN、GEAR。有没有办法做到这一点?
下面是我试过的代码:
new_df[['PLUGS', 'DESIGN', 'GEAR']] = new_df['PLUGS\nDESIGN\nGEAR'].str.split(' ')
print(new_df)
预期输出:
PLUGS DESIGN GEAR
0 700 Daewoo 8000 Gearless
1 300 Hyundai 4400 Gearless
2 600 STX 2600 Gearless
3 200 B170 Geared
4 362 Wenchong 1700 Mk II Geared
5 252 RichMax 1550 Gearless
6 220 CV 1100 Plus Geared
7 232 Orskov Mk VII Gearless
8 119 Kouan 1000 Gearless
9 100 Hanjin 700 Gearless
答案 0 :(得分:2)
正如评论部分所建议的,正则表达式在这里应该可以很好地工作,
>>> df
PLUGS\nDESIGN\nGEAR
0 700\nDaewoo 8000 Gearless
1 300\nHyundai 4400 Gearless
2 600\nSTX 2600 Gearless
3 200\nB170 \nGeared
4 362 Wenchong 1700 Mk II \nGeared
5 252\nRichMax 1550 Gearless
6 220\nCV 1100 Plus \nGeared
7 232\nOrskov Mk VII Gearless
8 119\nKouan 1000 Gearless
9 100\nHanjin 700 Gearless
只需从列名中删除换行符即可使可读性也易于使用。
>>> df.columns = df.columns.str.replace(r"\\n", " ", regex=True)
现在,列名没有任何特殊的汽车:
>>> df
PLUGS DESIGN GEAR
0 700\nDaewoo 8000 Gearless
1 300\nHyundai 4400 Gearless
2 600\nSTX 2600 Gearless
3 200\nB170 \nGeared
4 362 Wenchong 1700 Mk II \nGeared
5 252\nRichMax 1550 Gearless
6 220\nCV 1100 Plus \nGeared
7 232\nOrskov Mk VII Gearless
8 119\nKouan 1000 Gearless
9 100\nHanjin 700 Gearless
现在,我们可以使用 pandas.Series.str.extract。使用 regex
方法时,所有命名组 ()
将成为结果中的列名。
因为,命名组将成为具有预定义名称(如 0,1,2
)的列,因此我们可以使用所需的名称完全重命名它们以获得所需的结果,如下所示:
>>> df = df['PLUGS DESIGN GEAR'].str.extract(r"^(\d+)[\\n\s]+([^\\]+)[\\n\s]+([\\n|^Gear][a-z]+)").rename(columns={0: 'PLUGS', 1: 'DESIGN', 2: 'GEAR'})
>>> print(df)
PLUGS DESIGN GEAR
0 700 Daewoo 8000 Gearless
1 300 Hyundai 4400 Gearless
2 600 STX 2600 Gearless
3 200 B170 Geared
4 362 Wenchong 1700 Mk II Geared
5 252 RichMax 1550 Gearless
6 220 CV 1100 Plus Geared
7 232 Orskov Mk VII Gearless
8 119 Kouan 1000 Gearless
9 100 Hanjin 700 Gearless
正则表达式解释:
您可以在regex101.com
查看(\d+)[\\n\s]+([^\\]+)[\\n\s]+([\|^Gear][a-z]+)
第一个捕获组 (\d+)
\d matches a digit (equivalent to [0-9])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
Match a single character present in the list below [\\n\s]
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
\\ matches the character \ literally (case sensitive)
n matches the character n literally (case sensitive)
\s matches any whitespace character (equivalent to [\r\n\t\f\v ])
第二捕获组 ([^\]+)
Match a single character not present in the list below [^\\]
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
\\ matches the character \ literally (case sensitive)
Match a single character present in the list below [\\n\s]
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
\\ matches the character \ literally (case sensitive)
n matches the character n literally (case sensitive)
\s matches any whitespace character (equivalent to [\r\n\t\f\v ])
第三捕获组 ([|^Gear][a-z]+)
Match a single character present in the list below [\|^Gear]
\| matches the character | literally (case sensitive)
^Gear matches a single character in the list ^Gear (case sensitive)
Match a single character present in the list below [a-z]
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
a-z matches a single character in the range between a (index 97) and z (index 122) (case sensitive)
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
答案 1 :(得分:1)
从您的数据框开始:
>>> import pandas as pd
>>> df = pd.DataFrame({'PLUGS\nDESIGN\nGEAR': ['700\nDaewoo 8000 Gearless', '300\nHyundai 4400 Gearless', '600\nSTX 2600 Gearless', '200\nB170 \nGeared', '362 Wenchong 1700 Mk II \nGeared', '252\nRichMax 1550 Gearless'], },
... index = [0, 1, 2, 3, 4, 5])
>>> df
PLUGS\nDESIGN\nGEAR
0 700\nDaewoo 8000 Gearless
1 300\nHyundai 4400 Gearless
2 600\nSTX 2600 Gearless
3 200\nB170 \nGeared
4 362 Wenchong 1700 Mk II \nGeared
5 252\nRichMax 1550 Gearless
您确实可以在多个分隔符上使用 split
方法,此处为 \n
和 space
:
>>> df = pd.DataFrame(df['PLUGS\nDESIGN\nGEAR'].str.split('\n| '))
PLUGS\nDESIGN\nGEAR
0 [700, Daewoo, 8000, , Gearless]
1 [300, Hyundai, 4400, , Gearless]
2 [600, STX, 2600, , Gearless]
3 [200, B170, , Geared]
4 [362, Wenchong, 1700, Mk, II, , Geared]
5 [252, RichMax, 1550, , Gearless]
然后,您可以将第一个和最后一个元素分配给正确的列,其余的分配给 DESIGN
列:
>>> df['PLUGS'] = df['PLUGS\nDESIGN\nGEAR'].str[0]
>>> df['DESIGN'] = df['PLUGS\nDESIGN\nGEAR'].str[1:-1]
>>> df['GEAR'] = df['PLUGS\nDESIGN\nGEAR'].str[-1]
>>> df
PLUGS\nDESIGN\nGEAR PLUGS DESIGN GEAR
0 [700, Daewoo, 8000, , Gearless] 700 [Daewoo, 8000, ] Gearless
1 [300, Hyundai, 4400, , Gearless] 300 [Hyundai, 4400, ] Gearless
2 [600, STX, 2600, , Gearless] 600 [STX, 2600, ] Gearless
3 [200, B170, , Geared] 200 [B170, ] Geared
4 [362, Wenchong, 1700, Mk, II, , Geared] 362 [Wenchong, 1700, Mk, II, ] Geared
5 [252, RichMax, 1550, , Gearless] 252 [RichMax, 1550, ] Gearless
最后要做的是改进 DESIGN
列以使用 join
方法将其映射为字符串而不是列表,并删除 PLUGS\nDESIGN\nGEAR
列,如下所示:< /p>
>>> df['DESIGN'] = df['DESIGN'].apply(lambda x: ' '.join(map(str, x)))
>>> df.drop(['PLUGS\nDESIGN\nGEAR'], axis=1)
PLUGS DESIGN GEAR
0 700 Daewoo 8000 Gearless
1 300 Hyundai 4400 Gearless
2 600 STX 2600 Gearless
3 200 B170 Geared
4 362 Wenchong 1700 Mk II Geared
5 252 RichMax 1550 Gearless