行窗口之间的常用术语

时间:2017-09-22 07:01:46

标签: python pandas apache-spark dataframe pyspark

我是熊猫的新手,正在尝试一些东西。所以这是我的数据帧的代码:

import pandas as pd

df = pd.DataFrame(['one two three','two three six','two five six','six seven five','five nine'], columns=['Numbers'])

print df

输出:

          Numbers
0   one two three
1   two three six
2    two five six
3  six seven five
4       five nine

我想提取每个连续3行之间的常用术语?所以输出将是这样的:

          common_Numbers
0          None
1          None
2           two
3           six
4          five

第一行和第二行包含None,因为没有至少3行。那么有什么方法可以使用某种窗口操作吗?我有大量的行> 1M,所以每3行循环不是一个选项。

编辑:在Apache-spark中执行此操作是否可行/高效,优先使用PySpark?

3 个答案:

答案 0 :(得分:1)

Pandas Data Frames有一个rolling method来实现类似SQL的"窗口函数"。

如果您选择使用Spark(适用于大型数据集),则需要使用Spark SQL API。这是另一个question特别解决的问题。

答案 1 :(得分:1)

使用熊猫,这是达到你想要的一种方法:

s1 = df.Numbers.str.split()
s2 = df.Numbers.shift(1).fillna('').str.split()
s3 = df.Numbers.shift(2).fillna('').str.split()
pd.concat([s1, s2, s3]
          ,axis=1).apply(lambda x: set(x[0]).intersection(set(x[1]).intersection(x[2]))
          ,axis=1)

详细执行:

In [28]: s1 = df.Numbers.str.split() 

In [29]: s1
Out[29]: 
0     [one, two, three]
1     [two, three, six]
2      [two, five, six]
3    [six, seven, five]
4          [five, nine]
Name: Numbers, dtype: object

In [30]: s2 = df.Numbers.shift(1).fillna('').str.split()

In [31]: s2
Out[31]: 
0                    []
1     [one, two, three]
2     [two, three, six]
3      [two, five, six]
4    [six, seven, five]
Name: Numbers, dtype: object

In [32]: s3 = df.Numbers.shift(2).fillna('').str.split()

In [33]: s3
Out[33]: 
0                   []
1                   []
2    [one, two, three]
3    [two, three, six]
4     [two, five, six]
Name: Numbers, dtype: object


In [35]: pd.concat([s1, s2, s3], axis=1)
Out[35]: 
              Numbers             Numbers            Numbers
0   [one, two, three]                  []                 []
1   [two, three, six]   [one, two, three]                 []
2    [two, five, six]   [two, three, six]  [one, two, three]
3  [six, seven, five]    [two, five, six]  [two, three, six]
4        [five, nine]  [six, seven, five]   [two, five, six]

In [36]: pd.concat([s1, s2, s3], axis=1).apply(lambda x: set(x[0]).intersection(set(x[1]).intersection(x[2])), axis=1)
Out[36]: 
0        {}
1        {}
2     {two}
3     {six}
4    {five}
dtype: object

答案 2 :(得分:0)

快速而肮脏的候选解决方案:

from collections import Counter
import pandas as pd

df = pd.DataFrame(['one two three','two three six','two five six','six seven five','five nine'], columns=['commonNumbers'])

def getCommon(dfCommonNumbersColData):
    d = list(dfCommonNumbersColData) # May not be computationally efficient
    new_data = [None] * len(d) # Initialize list
    values = None
    for index, row in enumerate(d):
        if index > 1:
            row = row.split(" ") # Split the data
            next_row = d[index - 1].split(" ") # Last index
            next_next = d[index - 2].split(" ") # Last last index
            values = row + next_row + next_next # Join list to find common
             # Pass to counter to do a value_counts and grab the most common from the results
            new_data[index] = Counter(values).most_common()[0][0]
    return new_data # Return results

df['Common'] = getCommon(df['commonNumbers'])
df[2:].head(20) # Check the top values for accuracy

您可能必须找到处理1M plus记录的方法,但我认为这将是一个很好的起点。

显然,这种解决方案错过了一些"一些"案例,但一般解决方案就在那里。

将这些记录推送到数据库并查询流数据可能是个好主意。 < -That是另一个问题