我将在这里更详细地描述我的意思。 假设我有一个如下所示的数据表:
+-----------+---------+---------+---------+---------+---------+---------+--------------+
| | Person1 | Person2 | Person4 | Person4 | Person5 | Person6 | City |
+-----------+---------+---------+---------+---------+---------+---------+--------------+
| January | - | - | Yes | - | Yes | - | SanFrancisco |
| Febuary | Yes | - | - | - | - | - | SanFrancisco |
| March | - | - | - | - | - | - | SanFrancisco |
| April | - | - | - | - | - | - | NewYork |
| May | Yes | - | - | - | - | - | NewYork |
| June | - | - | - | - | - | - | NewYork |
| July | - | - | - | - | Yes | - | NewYork |
| August | - | - | - | - | - | - | NewYork |
| September | - | - | - | - | - | - | Miami |
| November | - | - | - | - | - | Yes | Miami |
| December | - | - | - | - | - | - | Miami |
+-----------+---------+---------+---------+---------+---------+---------+--------------+
忽略ascii for stackoverflow格式化,这是一个简单的电子表格,根据他们去往哪个城市追踪6个人。
我想知道的是,哪些人访问了哪些城市。有效地将列表压缩成如下所示:
+---------+---------+---------+---------+---------+---------+--------------+
| Person1 | Person2 | Person4 | Person4 | Person5 | Person6 | City |
+---------+---------+---------+---------+---------+---------+--------------+
| Yes | - | Yes | - | Yes | - | SanFrancisco |
| Yes | - | - | - | Yes | - | NewYork |
| - | - | - | - | - | Yes | Miami |
+---------+---------+---------+---------+---------+---------+--------------+
每行只有一个城市,包含哪些人访问过它。有没有一种最佳的方法来做到这一点,或者说,是否有某种tr(挤压)/ sed工具已经做到了这一点?如果我必须对此进行编码,那么最佳逻辑是什么?
答案 0 :(得分:2)
您在此处尝试执行的操作的正确用语是聚合。根据我的经验, collapse 这个词并不常用于此操作。
我在这里即时学习python,所以可能有更好的方法,但我已经使用pandas
模块,特别是{{{{}}工作了3}}类型:
import pandas;
import re;
df = pandas.DataFrame({
'Date':['January','Febuary','March','April','May','June','July','August','September','November','December'],
'Person1':['-','Yes','-','-','Yes','-','-','-','-','-','-'],
'Person2':['-','-','-','-','-','-','-','-','-','-','-'],
'Person3':['Yes','-','-','-','-','-','-','-','-','-','-'],
'Person4':['-','-','-','-','-','-','-','-','-','-','-'],
'Person5':['Yes','-','-','-','-','-','Yes','-','-','-','-'],
'Person6':['-','-','-','-','-','-','-','-','-','Yes','-'],
'City':['SanFrancisco','SanFrancisco','SanFrancisco','NewYork','NewYork','NewYork','NewYork','NewYork','Miami','Miami','Miami']
});
df.groupby('City').agg({k:lambda x: 'Yes' if 'Yes' in x.values else '-' for k in filter(lambda x:re.search(r'^Person',x),df.keys())});
## Person2 Person3 Person1 Person6 Person4 Person5
## City
## Miami - - - Yes - -
## NewYork - - Yes - - Yes
## SanFrancisco - Yes Yes - - Yes
此外,我强烈建议您查看DataFrame
,这是一个优秀且越来越普遍的统计,图形和通用数据分析平台,非常适合处理Excel样式的表格数据。这些类型的数据格式转换在R中肯定更自然,尽管学习曲线相当陡峭。这是R的实现:
df <- read.csv(stringsAsFactors=F,text=
'Date,Person1,Person2,Person3,Person4,Person5,Person6,City
January,-,-,Yes,-,Yes,-,SanFrancisco
Febuary,Yes,-,-,-,-,-,SanFrancisco
March,-,-,-,-,-,-,SanFrancisco
April,-,-,-,-,-,-,NewYork
May,Yes,-,-,-,-,-,NewYork
June,-,-,-,-,-,-,NewYork
July,-,-,-,-,Yes,-,NewYork
August,-,-,-,-,-,-,NewYork
September,-,-,-,-,-,-,Miami
November,-,-,-,-,-,Yes,Miami
December,-,-,-,-,-,-,Miami'
);
aggregate(.~City,df[-1L],function(x) if (any(x=='Yes')) 'Yes' else '-');
## City Person1 Person2 Person3 Person4 Person5 Person6
## 1 Miami - - - - - Yes
## 2 NewYork Yes - - - Yes -
## 3 SanFrancisco Yes - Yes - Yes -
答案 1 :(得分:1)
$ cat tst.awk
function prt() {
if ( prev != "" ) {
for (i=2;i<=NF;i++) {
printf "%s%s", vals[i], (i<NF ? OFS : ORS)
}
}
delete vals
}
BEGIN { FS=OFS="," }
$NF != prev { prt() }
{
for (i=1;i<=NF;i++) {
vals[i] = (vals[i] ~ /[[:alpha:]]/ ? vals[i] : $i)
}
prev = $NF
}
END { prt() }
$ awk -f tst.awk file
Person1,Person2,Person4,Person4,Person5,Person6,City
Yes,-,Yes,-,Yes,-,SanFrancisco
Yes,-,-,-,Yes,-,NewYork
-,-,-,-,-,Yes,Miami
以上假设您的输入格式实际上是这样的CSV:
$ cat file
Month,Person1,Person2,Person4,Person4,Person5,Person6,City
January,-,-,Yes,-,Yes,-,SanFrancisco
Febuary,Yes,-,-,-,-,-,SanFrancisco
March,-,-,-,-,-,-,SanFrancisco
April,-,-,-,-,-,-,NewYork
May,Yes,-,-,-,-,-,NewYork
June,-,-,-,-,-,-,NewYork
July,-,-,-,-,Yes,-,NewYork
August,-,-,-,-,-,-,NewYork
September,-,-,-,-,-,-,Miami
November,-,-,-,-,-,Yes,Miami
December,-,-,-,-,-,-,Miami
你想要一个CSV输出。