我是熊猫的新手,正在尝试一些东西。所以这是我的数据帧的代码:
import pandas as pd
df = pd.DataFrame(['one two three','two three six','two five six','six seven five','five nine'], columns=['Numbers'])
print df
输出:
Numbers
0 one two three
1 two three six
2 two five six
3 six seven five
4 five nine
我想提取每个连续3行之间的常用术语?所以输出将是这样的:
common_Numbers
0 None
1 None
2 two
3 six
4 five
第一行和第二行包含None,因为没有至少3行。那么有什么方法可以使用某种窗口操作吗?我有大量的行> 1M,所以每3行循环不是一个选项。
编辑:在Apache-spark中执行此操作是否可行/高效,优先使用PySpark?
答案 0 :(得分:1)
Pandas Data Frames有一个rolling
method来实现类似SQL的"窗口函数"。
如果您选择使用Spark(适用于大型数据集),则需要使用Spark SQL API。这是另一个question特别解决的问题。
答案 1 :(得分:1)
使用熊猫,这是达到你想要的一种方法:
s1 = df.Numbers.str.split()
s2 = df.Numbers.shift(1).fillna('').str.split()
s3 = df.Numbers.shift(2).fillna('').str.split()
pd.concat([s1, s2, s3]
,axis=1).apply(lambda x: set(x[0]).intersection(set(x[1]).intersection(x[2]))
,axis=1)
详细执行:
In [28]: s1 = df.Numbers.str.split()
In [29]: s1
Out[29]:
0 [one, two, three]
1 [two, three, six]
2 [two, five, six]
3 [six, seven, five]
4 [five, nine]
Name: Numbers, dtype: object
In [30]: s2 = df.Numbers.shift(1).fillna('').str.split()
In [31]: s2
Out[31]:
0 []
1 [one, two, three]
2 [two, three, six]
3 [two, five, six]
4 [six, seven, five]
Name: Numbers, dtype: object
In [32]: s3 = df.Numbers.shift(2).fillna('').str.split()
In [33]: s3
Out[33]:
0 []
1 []
2 [one, two, three]
3 [two, three, six]
4 [two, five, six]
Name: Numbers, dtype: object
In [35]: pd.concat([s1, s2, s3], axis=1)
Out[35]:
Numbers Numbers Numbers
0 [one, two, three] [] []
1 [two, three, six] [one, two, three] []
2 [two, five, six] [two, three, six] [one, two, three]
3 [six, seven, five] [two, five, six] [two, three, six]
4 [five, nine] [six, seven, five] [two, five, six]
In [36]: pd.concat([s1, s2, s3], axis=1).apply(lambda x: set(x[0]).intersection(set(x[1]).intersection(x[2])), axis=1)
Out[36]:
0 {}
1 {}
2 {two}
3 {six}
4 {five}
dtype: object
答案 2 :(得分:0)
快速而肮脏的候选解决方案:
from collections import Counter
import pandas as pd
df = pd.DataFrame(['one two three','two three six','two five six','six seven five','five nine'], columns=['commonNumbers'])
def getCommon(dfCommonNumbersColData):
d = list(dfCommonNumbersColData) # May not be computationally efficient
new_data = [None] * len(d) # Initialize list
values = None
for index, row in enumerate(d):
if index > 1:
row = row.split(" ") # Split the data
next_row = d[index - 1].split(" ") # Last index
next_next = d[index - 2].split(" ") # Last last index
values = row + next_row + next_next # Join list to find common
# Pass to counter to do a value_counts and grab the most common from the results
new_data[index] = Counter(values).most_common()[0][0]
return new_data # Return results
df['Common'] = getCommon(df['commonNumbers'])
df[2:].head(20) # Check the top values for accuracy
您可能必须找到处理1M plus记录的方法,但我认为这将是一个很好的起点。
显然,这种解决方案错过了一些"一些"案例,但一般解决方案就在那里。
将这些记录推送到数据库并查询流数据可能是个好主意。 < -That是另一个问题