Pyspark过滤列表列中的项目

时间:2020-03-27 10:43:50

标签: python dataframe filter pyspark

我正在尝试过滤数据框中的数据。数据框window.addEventListener('load', () => { const now = new Date(); now.setMinutes(now.getMinutes() - now.getTimezoneOffset()); document.getElementById('cal').value = now.toISOString().slice(0, -1); }); 具有2列- <p dangerouslySetInnerHTML={{ __html: questions["question1"] }} /> + df。在一行中:query是随机字符串,而href是字符串列表。我还有一个名为query的带有字符串的列表。

要从href列列表内的列表urls中查找URL,并找到URL在urls列表中的位置。我正在尝试href,但pyspark抱怨该名单。 +我无法对数据量进行.collect()bcs。

提前谢谢!

基本上它应该看起来像这样,但是我不确定如何在pyspark中做到这一点:

href

示例:

df.filter(col("href")).isin(urls)

2 个答案:

答案 0 :(得分:0)

语言步骤:

  1. explodehref
  2. filter那些具有已知URL的行
  3. collect结果,并在urls中查找每个URL

下面的代码分为几个小步骤,以使检查中间DataFrame更加容易。

假设您已经有一个名为SparkSession的{​​{1}}对象,我们可以像这样重新创建原始的DataFrame:

ss

现在,我们应用前面描述的步骤:

df = ss.createDataFrame(
    [
        ("q1", ["url7", "url11", "url12", "url13", "url14"]),
        ("q2", ["url1", "url3", "url5", "url6"]),
        ("q3", ["url1", "url2", "url8"]),
    ],
    ["query", "href"],
)
urls = ["url1", "url2", "url3", "url4", "url5", "url6", "url7", "url8"]

检查一些值:

import pyspark.sql.functions as sf

# Exploding the column "href".
exp_df = df.select("query", sf.explode(sf.col("href")).alias("href_sing"))
# Checking if the URL in the DataFrame exists in "urls".
# I suggest to convert "urls" into a "set" before this step: "set(urls)". It might 
# improve the performance of "isin", but this is just an optional optimization.
known_df = exp_df.select("*", sf.col("href_sing").isin(urls).alias("is_known"))
# Discard unknown URLs.
true_df = true_df = known_df.filter("is_known = True")
# The final results.
res = [
    (r["query"], r["href_sing"], urls.index(r["href_sing"]))
    for r in true_df.collect()
]

答案 1 :(得分:0)

我建议1)使用urlsexplode创建单列DataFrame,并2)使用posexplode为查询,href和索引位置创建3列DataFrame的href,然后3)内部加入两者

  1. 创建urls的DataFrame
from pyspark.sql.functions import explode, posexplode

urls = [
    (['url1', 'url2', 'url3', 'url4', 'url5', 'url6', 'url7', 'url8'],),
]
refs = (
    spark.createDataFrame(urls, ['ref']).
        select(
            explode('ref')
        )
)
refs.show(truncate=False)
# +----+
# |col |
# +----+
# |url1|
# |url2|
# |url3|
# |url4|
# |url5|
# |url6|
# |url7|
# |url8|
# +----+
  1. 创建您提供的示例数据
data = [
    ("q1", ["url7", "url11", "url12", "url13", "url14"]),
    ("q2", ["url1", "url3", "url5", "url6"]),
    ("q3", ["url1", "url2", "url8"]),
]
df = spark.createDataFrame(data, ["query", "href"])
df.show(truncate=False)
# +-----+----------------------------------+
# |query|href                              |
# +-----+----------------------------------+
# |q1   |[url7, url11, url12, url13, url14]|
# |q2   |[url1, url3, url5, url6]          |
# |q3   |[url1, url2, url8]                |
# +-----+----------------------------------+
  1. 解决方案
(
    df.
        select(
            'query',
            posexplode('href')
        ).
        join(
            refs,
            'col',
            'inner'
        ).
        orderBy('col', 'query').
        show(truncate=False)
)
# +----+-----+---+                                                                
# |col |query|pos|
# +----+-----+---+
# |url1|q2   |0  |
# |url1|q3   |0  |
# |url2|q3   |1  |
# |url3|q2   |1  |
# |url5|q2   |2  |
# |url6|q2   |3  |
# |url7|q1   |0  |
# |url8|q3   |2  |
# +----+-----+---+