熊猫数据框的多循环和多拆分

时间:2020-10-10 07:19:30

标签: python pandas csv data-science data-cleaning

我有一个包含22000行作者姓名的CSV文件。

  1. 每行有多个作者名,以';'分隔。
  2. 行中的每个作者姓名均按“姓,名”的顺序排列。

我想将它们拆分并追加到如下所示的新列中。

原始数据集预览

+------------------------------------+
|           author_full_name         |
+------------------------------------+
| Kahana, M J; Adler, M              |
|Gautam, H; Potdar, G G; Vidya, T N C|
+------------------------------------+

预期产量

+------------------------------------+------------------------------------------+
|           author_full_name         | author_first_names| author_last_names    |
+------------------------------------+------------------------------------------+
| Kahana, M J; Adler, M              |      M J; M       | Kahana; Adler        |
|Gautam, H; Potdar, G G; Vidya, T N C|     H; G G; T N C | Gautam; Potdar; Vidya|
+------------------------------------+------------------------------------------+

我如何用熊猫来做到这一点?

1 个答案:

答案 0 :(得分:1)

这里的逻辑本质上是先除以;,然后将各个值除以,,然后将它们的firstvalue作为; ast name和2nd values作为first

>>> [x.split(",")[0] for x in "Gautam, H; Potdar, G G; Vidya, T N C".split(";")]
>>> ['Gautam', ' Potdar', ' Vidya']

在熊猫中使用apply:

import pandas as pd 
df = pd.DataFrame({"Name":["Gautam, H; Potdar, G G; Vidya, T N C","Kahana, M J; Adler, M "]})
df['author_last_names'] = df['Name'].apply(lambda x: ";".join([ele.split(",")[1] for ele in x.split(";")]))
df['author_first_names'] = df['Name'].apply(lambda x: ";".join([ele.split(",")[0] for ele in x.split(";")]))

df

输出:

------------------------------------|-----------------|------------------------
Gautam, H; Potdar, G G; Vidya, T N C  H; G G; T N C      Gautam; Potdar; Vidya
Kahana, M J; Adler, M                 M J; M             Kahana; Adler
------------------------------------|-----------------|------------------------
相关问题