如何根据列中的值折叠行?

时间:2016-07-25 20:36:44

标签: python bash tr

我将在这里更详细地描述我的意思。 假设我有一个如下所示的数据表:

+-----------+---------+---------+---------+---------+---------+---------+--------------+
|           | Person1 | Person2 | Person4 | Person4 | Person5 | Person6 |     City     |
+-----------+---------+---------+---------+---------+---------+---------+--------------+
| January   | -       |       - | Yes     |       - | Yes     | -       | SanFrancisco |
| Febuary   | Yes     |       - | -       |       - | -       | -       | SanFrancisco |
| March     | -       |       - | -       |       - | -       | -       | SanFrancisco |
| April     | -       |       - | -       |       - | -       | -       | NewYork      |
| May       | Yes     |       - | -       |       - | -       | -       | NewYork      |
| June      | -       |       - | -       |       - | -       | -       | NewYork      |
| July      | -       |       - | -       |       - | Yes     | -       | NewYork      |
| August    | -       |       - | -       |       - | -       | -       | NewYork      |
| September | -       |       - | -       |       - | -       | -       | Miami        |
| November  | -       |       - | -       |       - | -       | Yes     | Miami        |
| December  | -       |       - | -       |       - | -       | -       | Miami        |
+-----------+---------+---------+---------+---------+---------+---------+--------------+

忽略ascii for stackoverflow格式化,这是一个简单的电子表格,根据他们去往哪个城市追踪6个人。

我想知道的是,哪些人访问了哪些城市。有效地将列表压缩成如下所示:

+---------+---------+---------+---------+---------+---------+--------------+
| Person1 | Person2 | Person4 | Person4 | Person5 | Person6 | City         |
+---------+---------+---------+---------+---------+---------+--------------+
| Yes     | -       | Yes     | -       | Yes     | -       | SanFrancisco |
| Yes     | -       | -       | -       | Yes     | -       | NewYork      |
| -       | -       | -       | -       | -       | Yes     | Miami        |
+---------+---------+---------+---------+---------+---------+--------------+

每行只有一个城市,包含哪些人访问过它。有没有一种最佳的方法来做到这一点,或者说,是否有某种tr(挤压)/ sed工具已经做到了这一点?如果我必须对此进行编码,那么最佳逻辑是什么?

2 个答案:

答案 0 :(得分:2)

您在此处尝试执行的操作的正确用语是聚合。根据我的经验, collapse 这个词并不常用于此操作。

我在这里即时学习python,所以可能有更好的方法,但我已经使用pandas模块,特别是{{{{}}工作了3}}类型:

import pandas;
import re;

df = pandas.DataFrame({
    'Date':['January','Febuary','March','April','May','June','July','August','September','November','December'],
    'Person1':['-','Yes','-','-','Yes','-','-','-','-','-','-'],
    'Person2':['-','-','-','-','-','-','-','-','-','-','-'],
    'Person3':['Yes','-','-','-','-','-','-','-','-','-','-'],
    'Person4':['-','-','-','-','-','-','-','-','-','-','-'],
    'Person5':['Yes','-','-','-','-','-','Yes','-','-','-','-'],
    'Person6':['-','-','-','-','-','-','-','-','-','Yes','-'],
    'City':['SanFrancisco','SanFrancisco','SanFrancisco','NewYork','NewYork','NewYork','NewYork','NewYork','Miami','Miami','Miami']
});

df.groupby('City').agg({k:lambda x: 'Yes' if 'Yes' in x.values else '-' for k in filter(lambda x:re.search(r'^Person',x),df.keys())});
##              Person2 Person3 Person1 Person6 Person4 Person5
## City
## Miami              -       -       -     Yes       -       -
## NewYork            -       -     Yes       -       -     Yes
## SanFrancisco       -     Yes     Yes       -       -     Yes

此外,我强烈建议您查看DataFrame,这是一个优秀且越来越普遍的统计,图形和通用数据分析平台,非常适合处理Excel样式的表格数据。这些类型的数据格式转换在R中肯定更自然,尽管学习曲线相当陡峭。这是R的实现:

df <- read.csv(stringsAsFactors=F,text=
'Date,Person1,Person2,Person3,Person4,Person5,Person6,City
January,-,-,Yes,-,Yes,-,SanFrancisco
Febuary,Yes,-,-,-,-,-,SanFrancisco
March,-,-,-,-,-,-,SanFrancisco
April,-,-,-,-,-,-,NewYork
May,Yes,-,-,-,-,-,NewYork
June,-,-,-,-,-,-,NewYork
July,-,-,-,-,Yes,-,NewYork
August,-,-,-,-,-,-,NewYork
September,-,-,-,-,-,-,Miami
November,-,-,-,-,-,Yes,Miami
December,-,-,-,-,-,-,Miami'
);

aggregate(.~City,df[-1L],function(x) if (any(x=='Yes')) 'Yes' else '-');
##           City Person1 Person2 Person3 Person4 Person5 Person6
## 1        Miami       -       -       -       -       -     Yes
## 2      NewYork     Yes       -       -       -     Yes       -
## 3 SanFrancisco     Yes       -     Yes       -     Yes       -

答案 1 :(得分:1)

$ cat tst.awk
function prt() {
    if ( prev != "" ) {
        for (i=2;i<=NF;i++) {
            printf "%s%s", vals[i], (i<NF ? OFS : ORS)
        }
    }
    delete vals
}

BEGIN { FS=OFS="," }
$NF != prev { prt() }
{
    for (i=1;i<=NF;i++) {
        vals[i] = (vals[i] ~ /[[:alpha:]]/ ? vals[i] : $i)
    }
    prev = $NF
}
END { prt() }

$ awk -f tst.awk file
Person1,Person2,Person4,Person4,Person5,Person6,City
Yes,-,Yes,-,Yes,-,SanFrancisco
Yes,-,-,-,Yes,-,NewYork
-,-,-,-,-,Yes,Miami

以上假设您的输入格式实际上是这样的CSV:

$ cat file
Month,Person1,Person2,Person4,Person4,Person5,Person6,City
January,-,-,Yes,-,Yes,-,SanFrancisco
Febuary,Yes,-,-,-,-,-,SanFrancisco
March,-,-,-,-,-,-,SanFrancisco
April,-,-,-,-,-,-,NewYork
May,Yes,-,-,-,-,-,NewYork
June,-,-,-,-,-,-,NewYork
July,-,-,-,-,Yes,-,NewYork
August,-,-,-,-,-,-,NewYork
September,-,-,-,-,-,-,Miami
November,-,-,-,-,-,Yes,Miami
December,-,-,-,-,-,-,Miami

你想要一个CSV输出。