所以我基本上是在分析调查数据集。数据集如下所示:
Respondent Country HaveWorkedLanguage
0 1 United States Swift
1 2 United Kingdom JavaScript; Python; Ruby; SQL
2 3 United Kingdom Java; PHP; Python
3 4 United States Matlab; Python; R; SQL
4 5 Switzerland NaN
5 6 New Zealand JavaScript; PHP; Rust
正如您所看到的, HaveWorkedLanguage 列在每个单元格中都包含单个值或多个值的实例。我想做的是分析每个国家最着名的语言。为此我首先执行了这样的组合:
stu=students.groupby(['Country','HaveWorkedLanguage'])['Respondent'].count().reset_index()
stu.columns=[['Country','Known_Languages','Count']]
我得到了这样的数据框:
Country Known_Languages Count
0 Afghanistan Assembly; C; C++; Hack; Java; JavaScript 1
1 Afghanistan C 1
2 Albania C#; Java; Python; SQL 1
3 Albania C++; C#; Java; JavaScript; PHP 1
4 Albania C++; C#; JavaScript; SQL 1
5 Albania C++; Java; JavaScript; PHP; SQL 2
我实际上想要一个显示国家/地区和每种语言数量的数据框,以便最高计数显示最着名的语言。数据框应该是这样的:
Country Known_Languages Count
0 United States Java 100
1 United States Python 80
之前我能够使用以下代码找到整体着名语言:
for i in ['C','C++','C#','Java','Python','R','JavaScript']:
print(i,':',survey['HaveWorkedLanguage'].apply(lambda x: i in str(x).split('; ')).value_counts()[1])
输出结果为:
C : 6974
C++ : 8155
C# : 12476
Java : 14524
Python : 11704
R : 1634
JavaScript : 22875
但是现在我想把国家与它联系起来。我该怎么做?
答案 0 :(得分:1)
hwl = students.HaveWorkedLanguage
cty = students.Country
stu = hwl.str.get_dummies('; ').groupby(cty).sum()
pd.concat(
[stu.idxmax(1), stu.max(1)],
axis=1, keys=['Lang', 'Count']
)
Lang Count
Country
New Zealand JavaScript 1
Switzerland Java 0
United Kingdom Python 2
United States Matlab 1
<强> PROJECT 强> / 杀
numpy
技术
mask = students.HaveWorkedLanguage.notnull().values
fc, uc = pd.factorize(students.Country.values.astype(str))
hwl = students.HaveWorkedLanguage.values.astype(str)
lol = np.core.defchararray.split(hwl, '; ')
lol[np.flatnonzero(~mask)] = [[]]
i = fc.repeat([len(l) for l in lol])
j, ul = pd.factorize(np.concatenate(lol))
n = uc.size
m = ul.size
counts = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
x = counts.argmax(1)
pd.DataFrame(
np.column_stack([ul[x], counts[np.arange(n), x]]),
uc, ['Lang', 'Count'])
Lang Count
United States Swift 1
United Kingdom Python 2
Switzerland Swift 0
New Zealand JavaScript 1
计时
%%timeit
hwl = students.HaveWorkedLanguage
cty = students.Country
stu = hwl.str.get_dummies('; ').groupby(cty).sum()
pd.concat(
[stu.idxmax(1), stu.max(1)],
axis=1, keys=['Lang', 'Count']
)
100 loops, best of 3: 3.22 ms per loop
%%timeit
mask = students.HaveWorkedLanguage.notnull().values
fc, uc = pd.factorize(students.Country.values.astype(str))
hwl = students.HaveWorkedLanguage.values.astype(str)
lol = np.core.defchararray.split(hwl, '; ')
lol[np.flatnonzero(~mask)] = [[]]
i = fc.repeat([len(l) for l in lol])
j, ul = pd.factorize(np.concatenate(lol))
n = uc.size
m = ul.size
counts = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
x = counts.argmax(1)
pd.DataFrame(np.column_stack([ul[x], counts[np.arange(n), x]]), uc, ['Lang', 'Count'])
1000 loops, best of 3: 570 µs per loop
答案 1 :(得分:1)
我已经做了很多步骤,所以也许有人有更多的pythonic解决方案:
df = pd.DataFrame({"Country":["UK", "UK", "UK", "USA", "USA", "USA"], "Languages":["Python" , "Python, PHP, Java", "Java", "Python", "Java", "Python, Javascript"]})
df
Country Languages
0 UK Python
1 UK Python, PHP, Java
2 UK Java
3 USA Python
4 USA Java
5 USA Python, Javascript
df2 = df.Languages.apply(lambda row: pd.Series(row.split(","))).copy() # split the column
df3 = pd.get_dummies(df2, prefix_sep="", prefix="") # get dummies
df3
Java Python Javascript PHP Java
0 0 1 0 0 0
1 0 1 0 1 1
2 1 0 0 0 0
3 0 1 0 0 0
4 1 0 0 0 0
5 0 1 1 0 0
df4 = pd.merge(df[["Country"]], df3, left_index=True, right_index=True)
df4
Country Java Python Javascript PHP Java
0 UK 0 1 0 0 0
1 UK 0 1 0 1 1
2 UK 1 0 0 0 0
3 USA 0 1 0 0 0
4 USA 1 0 0 0 0
5 USA 0 1 1 0 0
df5 = df4.groupby("Country").sum().reset_index().copy() # sum it
df5
Country Java Python Javascript PHP Java
0 UK 1 2 0 1 1
1 USA 1 2 1 0 0
df6 = pd.melt(df5, id_vars=["Country"], var_name="Language", value_name="Value") # columns to rows
df6
Country Language Value
0 UK Java 1
1 USA Java 1
2 UK Python 2
3 USA Python 2
4 UK Javascript 0
5 USA Javascript 1
6 UK PHP 1
7 USA PHP 0
8 UK Java 1
9 USA Java 0
df7 = df6.sort_values(by=["Country", "Value"], ascending=False) # sort
df7
Country Language Value
3 USA Python 2
1 USA Java 1
5 USA Javascript 1
7 USA PHP 0
9 USA Java 0
2 UK Python 2
0 UK Java 1
6 UK PHP 1
8 UK Java 1
4 UK Javascript 0