Question

我试图遍历Hacker News数据集，并试图创建在HN论坛上找到的3个类别（即帖子类型），即ask_posts，show_posts和other_posts。

简而言之，我试图找出每个类别每个帖子的平均评论数（如下所述）。

import pandas as pd
import datetime as dt

df = pd.read_csv('HN_posts_year_to_Sep_26_2016.csv')

ask_posts = []
show_posts = []
other_post = []
total_ask_comments = 0
total_show_comments = 0

for i, row in df.iterrows():
    title = row.title
    comments = row['num_comments']
    if title.lower().startswith('ask hn'):
        ask_posts.append(title)
        for post in ask_posts:
            total_ask_comments += comments
    elif title.lower().startswith('show hn'):
        show_posts.append(title)
        for post in show_posts:
             total_show_comments += comments
    else:
        other_post.append(title)

avg_ask_comments = total_ask_comments/len(ask_posts)
avg_show_comments = total_show_comments/len(show_posts)


print(total_ask_comments)
print(total_show_comments)

print(avg_ask_comments)
print(avg_show_comments)

结果分别是;

395976587

250362315

和

43328.21829521829

24646.81187241583

这些似乎很高，我不确定是否可以，因为这是我构造嵌套循环的方式的问题。这种方法正确吗？使用for循环执行此操作至关重要。

感谢您对我的代码的所有帮助/验证。

Answer 1

这篇文章没有专门回答关于遍历数据帧的问题；但它为您提供了更快的替代解决方案。

遍历Pandas数据框以按需收集信息将非常慢。使用过滤来获取所需信息的速度要快得多。

>>> show_posts = df[df.title.str.contains("show hn", case=False)]
>>> show_posts
              id  ...       created_at
52      12578335  ...   9/26/2016 0:36
58      12578182  ...   9/26/2016 0:01
64      12578098  ...  9/25/2016 23:44
70      12577991  ...  9/25/2016 23:17
140     12577142  ...  9/25/2016 20:06
...          ...  ...              ...
292995  10177714  ...   9/6/2015 14:21
293002  10177631  ...   9/6/2015 13:50
293019  10177511  ...   9/6/2015 13:02
293028  10177459  ...   9/6/2015 12:38
293037  10177421  ...   9/6/2015 12:16

[10189 rows x 7 columns]
>>> ask_posts = df[df.title.str.contains("ask hn", case=False)]
>>> ask_posts
              id  ...       created_at
10      12578908  ...   9/26/2016 2:53
42      12578522  ...   9/26/2016 1:17
76      12577908  ...  9/25/2016 22:57
80      12577870  ...  9/25/2016 22:48
102     12577647  ...  9/25/2016 21:50
...          ...  ...              ...
293047  10177359  ...   9/6/2015 11:27
293052  10177317  ...   9/6/2015 10:52
293055  10177309  ...   9/6/2015 10:46
293073  10177200  ...    9/6/2015 9:36
293114  10176919  ...    9/6/2015 6:02

[9147 rows x 7 columns]

您可以通过这种方式快速获取电话号码

>>> num_ask_comments = ask_posts.num_comments.sum()
>>> num_ask_comments
95000
>>> num_show_comments = show_posts.num_comments.sum()
>>> num_show_comments
50026
>>> 
>>> total_num_comments = df.num_comments.sum()
>>> total_num_comments
1912761
>>> 
>>> # Get a ratio of the number ask comments to total number of comments
>>> num_ask_comments / total_num_comments
0.04966642460819726
>>>

.startswith()与.contains()也会得到不同的数字（我不确定您想要哪个）。

>>> ask_posts = df[df.title.str.lower().str.startswith("ask hn")]
>>> len(ask_posts)
9139
>>> 
>>> ask_posts = df[df.title.str.contains("ask hn", case=False)]
>>> len(ask_posts)
9147
>>>

.contains()的模式参数可以是正则表达式-非常有用。因此，我们可以在标题的开头指定所有以“ ask hn”开头的记录，但是如果我们不确定在其前面是否有空格，可以这样做

>>> ask_posts = df[df.title.str.contains(r"^\s*ask hn", case=False)]
>>> len(ask_posts)
9139
>>>

刚开始使用Pandas时，可能很难理解filter语句中发生的事情。例如，df[df.title.str.contains("show hn", case=False)]方括号中的表达式。

方括号（df.title.str.contains("show hn", case=False)）中的语句产生的是一列True和False值-布尔值过滤器（不确定是否是它的名字，但是它具有这种效果）。

因此，所产生的布尔列用于选择数据帧df[<bool column>]中的行，并产生具有匹配记录的新数据帧。然后，我们可以使用它来提取其他信息，例如注释列的总和。

Answer 2

遍历pandas dataFrame 对象通常很慢。迭代击败了使用 DataFrame 的全部目的。这是一种反模式，只有在您用尽所有其他选项时才应该这样做。最好为 iterate through DataFrame 寻找 List Comprehensions、矢量化解决方案或 DataFrame.apply() 方法。列表推导式示例：

result = [(x, y,z) for x, y,z in zip(df['column1'], df['column2'],df['column3'])]

如何遍历熊猫数据框中的嵌套for循环？

2 个答案: