我有数据框,但所有字符串都是重复的,当我尝试打印图形时,它包含重复的列。我尝试删除它,但然后我的图形打印不正确。我的csv是here。
DataFrame common_users
:
used_at common users pair of websites
0 2014 1364 avito.ru and e1.ru
1 2014 1364 e1.ru and avito.ru
2 2014 1716 avito.ru and drom.ru
3 2014 1716 drom.ru and avito.ru
4 2014 1602 avito.ru and auto.ru
5 2014 1602 auto.ru and avito.ru
6 2014 299 avito.ru and avtomarket.ru
7 2014 299 avtomarket.ru and avito.ru
8 2014 579 avito.ru and am.ru
9 2014 579 am.ru and avito.ru
10 2014 602 avito.ru and irr.ru/cars
11 2014 602 irr.ru/cars and avito.ru
12 2014 424 avito.ru and cars.mail.ru/sale
13 2014 424 cars.mail.ru/sale and avito.ru
14 2014 634 e1.ru and drom.ru
15 2014 634 drom.ru and e1.ru
16 2014 475 e1.ru and auto.ru
17 2014 475 auto.ru and e1.ru
.....
您可以看到网站名称已被撤消。我尝试pair of websites
按KeyError
对其进行排序。我用代码
df = pd.read_csv("avito_trend.csv", parse_dates=[2])
def f(df):
dfs = []
for x in [list(x) for x in itertools.combinations(df['address'].unique(), 2)]:
c1 = df.loc[df['address'].isin([x[0]]), 'ID']
c2 = df.loc[df['address'].isin([x[1]]), 'ID']
c = pd.Series(list(set(c1).intersection(set(c2))))
#add inverted intersection c2 vs c1
c_invert = pd.Series(list(set(c2).intersection(set(c1))))
dfs.append(pd.DataFrame({'common users':len(c), 'pair of websites':' and '.join(x)}, index=[0]))
#swap values in x
x[1],x[0] = x[0],x[1]
dfs.append(pd.DataFrame({'common users':len(c_invert), 'pair of websites':' and '.join(x)}, index=[0]))
return pd.concat(dfs)
common_users = df.groupby([df['used_at'].dt.year]).apply(f).reset_index(drop=True, level=1).reset_index()
graph_by_common_users = common_users.pivot(index='pair of websites', columns='used_at', values='common users')
#sort by column 2014
graph_by_common_users = graph_by_common_users.sort_values(2014, ascending=False)
ax = graph_by_common_users.plot(kind='barh', width=0.5, figsize=(10,20))
[label.set_rotation(25) for label in ax.get_xticklabels()]
rects = ax.patches
labels = [int(round(graph_by_common_users.loc[i, y])) for y in graph_by_common_users.columns.tolist() for i in graph_by_common_users.index]
for rect, label in zip(rects, labels):
height = rect.get_height()
ax.text(rect.get_width() + 3, rect.get_y() + rect.get_height(), label, fontsize=8)
plt.show()
我的图表如下:
答案 0 :(得分:1)
您可以先在功能sort
中添加新列f
,然后按列pair of websites
排序值,按used_at
列{{1}排序drop_duplicates
}}:
sort
import pandas as pd
import itertools
df = pd.read_csv("avito_trend.csv",
parse_dates=[2])
def f(df):
dfs = []
i = 0
for x in [list(x) for x in itertools.combinations(df['address'].unique(), 2)]:
i += 1
c1 = df.loc[df['address'].isin([x[0]]), 'ID']
c2 = df.loc[df['address'].isin([x[1]]), 'ID']
c = pd.Series(list(set(c1).intersection(set(c2))))
#add inverted intersection c2 vs c1
c_invert = pd.Series(list(set(c2).intersection(set(c1))))
dfs.append(pd.DataFrame({'common users':len(c), 'pair of websites':' and '.join(x), 'sort': i}, index=[0]))
#swap values in x
x[1],x[0] = x[0],x[1]
dfs.append(pd.DataFrame({'common users':len(c_invert), 'pair of websites':' and '.join(x), 'sort': i}, index=[0]))
return pd.concat(dfs)
common_users = df.groupby([df['used_at'].dt.year]).apply(f).reset_index(drop=True, level=1).reset_index()
我的图表:
编辑:
Comment是:
由于多年common_users = common_users.sort_values('pair of websites')
common_users = common_users.drop_duplicates(subset=['used_at','sort'])
#print common_users
graph_by_common_users = common_users.pivot(index='pair of websites', columns='used_at', values='common users')
#print graph_by_common_users
#change order of columns
graph_by_common_users = graph_by_common_users[[2015,2014]]
graph_by_common_users = graph_by_common_users.sort_values(2014, ascending=False)
ax = graph_by_common_users.plot(kind='barh', width=0.5, figsize=(10,20))
[label.set_rotation(25) for label in ax.get_xticklabels()]
rects = ax.patches
labels = [int(round(graph_by_common_users.loc[i, y])) for y in graph_by_common_users.columns.tolist() for i in graph_by_common_users.index]
for rect, label in zip(rects, labels):
height = rect.get_height()
ax.text(rect.get_width() + 20, rect.get_y() - 0.25 + rect.get_height(), label, fontsize=8)
#sorting values of legend
handles, labels = ax.get_legend_handles_labels()
# sort both labels and handles by labels
labels, handles = zip(*sorted(zip(labels, handles), key=lambda t: t[0]))
ax.legend(handles, labels)
和2014
的组合不同,因此第一列中缺少2015
个值,第二列中缺少4
:
4
然后我创建所有倒置组合 - 问题解决了。但为什么有used_at 2015 2014
pair of websites
avito.ru and drom.ru 1491.0 1716.0
avito.ru and auto.ru 1473.0 1602.0
avito.ru and e1.ru 1153.0 1364.0
drom.ru and auto.ru NaN 874.0
e1.ru and drom.ru 539.0 634.0
avito.ru and irr.ru/cars 403.0 602.0
avito.ru and am.ru 262.0 579.0
e1.ru and auto.ru 451.0 475.0
avito.ru and cars.mail.ru/sale 256.0 424.0
drom.ru and irr.ru/cars 277.0 423.0
auto.ru and irr.ru/cars 288.0 409.0
auto.ru and am.ru 224.0 408.0
drom.ru and am.ru 187.0 394.0
auto.ru and cars.mail.ru/sale 195.0 330.0
avito.ru and avtomarket.ru 205.0 299.0
drom.ru and cars.mail.ru/sale 189.0 292.0
drom.ru and avtomarket.ru 175.0 247.0
auto.ru and avtomarket.ru 162.0 243.0
e1.ru and irr.ru/cars 148.0 235.0
e1.ru and am.ru 99.0 224.0
am.ru and irr.ru/cars NaN 223.0
irr.ru/cars and cars.mail.ru/sale 94.0 197.0
am.ru and cars.mail.ru/sale NaN 166.0
e1.ru and cars.mail.ru/sale 105.0 154.0
e1.ru and avtomarket.ru 105.0 139.0
avtomarket.ru and irr.ru/cars NaN 139.0
avtomarket.ru and am.ru 72.0 133.0
avtomarket.ru and cars.mail.ru/sale 48.0 105.0
auto.ru and drom.ru 799.0 NaN
cars.mail.ru/sale and am.ru 73.0 NaN
irr.ru/cars and am.ru 102.0 NaN
irr.ru/cars and avtomarket.ru 73.0 NaN
?为什么组合在NaN
和2014
中有所不同?
我添加到功能2015
:
f
和输出是(为什么第一次打印两次在def f(df):
print df['address'].unique()
dfs = []
i = 0
for x in [list(x) for x in itertools.combinations((df['address'].unique()), 2)]:
...
...
here中描述):
warning
所以列表不同,然后组合也不同 - >我得到了一些['avito.ru' 'e1.ru' 'drom.ru' 'auto.ru' 'avtomarket.ru' 'am.ru'
'irr.ru/cars' 'cars.mail.ru/sale']
['avito.ru' 'e1.ru' 'drom.ru' 'auto.ru' 'avtomarket.ru' 'am.ru'
'irr.ru/cars' 'cars.mail.ru/sale']
['avito.ru' 'e1.ru' 'auto.ru' 'drom.ru' 'irr.ru/cars' 'avtomarket.ru'
'cars.mail.ru/sale' 'am.ru']
值。
解决方案是排序组合列表。
NaN
所有代码均为:
def f(df):
#print (sorted(df['address'].unique()))
dfs = []
for x in [list(x) for x in itertools.combinations(sorted(df['address'].unique()), 2)]:
c1 = df.loc[df['address'].isin([x[0]]), 'ID']
...
...
import pandas as pd
import itertools
df = pd.read_csv("avito_trend.csv",
parse_dates=[2])
def f(df):
#print (sorted(df['address'].unique()))
dfs = []
for x in [list(x) for x in itertools.combinations(sorted(df['address'].unique()), 2)]:
c1 = df.loc[df['address'].isin([x[0]]), 'ID']
c2 = df.loc[df['address'].isin([x[1]]), 'ID']
c = pd.Series(list(set(c1).intersection(set(c2))))
dfs.append(pd.DataFrame({'common users':len(c), 'pair of websites':' and '.join(x)}, index=[0]))
return pd.concat(dfs)
common_users = df.groupby([df['used_at'].dt.year]).apply(f).reset_index(drop=True, level=1).reset_index()
#print common_users
graph_by_common_users = common_users.pivot(index='pair of websites', columns='used_at', values='common users')
#change order of columns
graph_by_common_users = graph_by_common_users[[2015,2014]]
graph_by_common_users = graph_by_common_users.sort_values(2014, ascending=False)
#print graph_by_common_users
图表:
答案 1 :(得分:0)
您的DataFrame
看起来不像您希望的那样。 DataFrame
包含2014
和2015
作为列标题名称 不作为行值 { {1}}索引。另外used_at
是索引名称 不第一行的索引标记。
您可以通过执行以下操作来测试这是否为真:
used_at
import pandas as pd
from cStringIO import StringIO
text_data = '''
used_at 2014 2015
address
am.ru 621 273
auto.ru 1752 1595
avito.ru 5460 4631
avtomarket.ru 314 215
cars.mail.ru/sale 457 271
drom.ru 1934 1623
e1.ru 1654 1359
irr.ru/cars 619 426
'''
# Read in tabular data with used_at row as header
df = pd.read_table(StringIO(text_data), sep='\s+', index_col=0)
print 'DataFrame created with used_at row as header:'
print df
print
# print df.used_at would cause AttributeError: 'DataFrame' object has no attribute 'used_at'
print 'df columns :', df.columns
print 'df index name :', df.index.name
print