Question

我有一个有关枪支暴力项目的数据集。列之一包括参与者类型，受害者或主题/可疑。对于事件中的每个参与者，参与者列中都有多个值。

import pandas as pd
data = pd.read_csv('Gun violence Shortened version.csv')
data.head()

输出：

 incident_id    date    state   participant_type    
0   461105  1/1/2013    Pennsylvania    0::Victim||1::Victim||2::Victim||3::Victim||4:...   
1   460726  1/1/2013    California  0::Victim||1::Victim||2::Victim||3::Victim||4:...   
2   478855  1/1/2013    Ohio    0::Subject-Suspect||1::Subject-Suspect||2::Vic...   
3   478925  1/5/2013    Colorado    0::Victim||1::Victim||2::Victim||3::Subject-Su...   
4   478959  1/7/2013    North Carolina  0::Victim||1::Victim||2::Victim||3::Subject-Su...

我想让每个参与者并给他们自己的一行，同时保持incident_id和日期相同：

incident_id date    state   participant_type    
0   461105  1/1/2013    Pennsylvania    Victim
1   461105  1/1/2013    Pennsylvania    Victim
2   461105  1/1/2013    Pennsylvania    Victim
3   461105  1/1/2013    Pennsylvania    Subject-Suspect *this was the 4:: instance that was cut off earlier*

我不确定该如何完成。我已经看到了将一列分为两部分的示例，但是没有从一列将其拆分为一行的例子。

Answer 1

我宁愿使用常规Python的数据结构预先准备数据，然后从中创建Pandas DataFrame。原因是Pandas并不是主要为诸如单个行操作之类的操作而设计的，尽管有很多方法可以做到，但它considered an anti-pattern却要慢得多。

以下代码使用Python标准库中的CSV module将CSV数据解析为常规列表，同时为最后一行中包含多个项目的每个CSV行添加多行。在最后一步中，只需从预处理列表中创建熊猫DataFrame：

import pandas as pd
import csv

data = []
with open('Gun violence Shortened versio.csv') as file:
    reader = csv.reader(file, delimiter=',')

    # iterate over all rows in the CSV
    for row in reader:
        # split the content of the last column by the || delimiter into a list
        # if there's no delimiter, it will produce a single-item list
        items = row[3].split('||')

        # append each item from the last column together with other columns
        # as an individual row to the data list, N items will produce N rows
        for item in items:
            data.append([row[0], row[1], row[2], item])

df = pd.DataFrame(data)

这不是最终的解决方案，您需要跳过第一行，清除最后一列中的单个项目，依此类推，但这应该是微不足道的。

有some benchmarks，在Pandas中对行的操作比使用Python的数据结构准备数据并从中创建DatFrame慢约1000倍。

Answer 2

下面是另一个获得输出的脚本，尽管Dawid共享的解决方案看起来更快

{{1}}

如何将单列中的数据拆分为新行（其他列在新行中保持不变）

2 个答案: