Question

经过2个pyspark数据帧，每个帧由1列组成，但是长度不同。数据框1是成分名称，数据框2包含成排的长成分字符串。

数据帧1：

ingcomb.show(10,truncate=False)
+---------------------------------+
|products                         |
+---------------------------------+
|rebel crunch granola             |
|creamed honey                    |
|mild cheddar with onions & chives|
|berry medley                     |
|sweet relish made with sea salt  |
|spanish peanuts                  |
|stir fry seasoning mix           |
|swiss all natural cheese         |
|yellow corn meal                 |
|shredded wheat                   |
+---------------------------------+
only showing top 10 rows

数据帧2：

reging.show(10, truncate=30)
+------------------------------+
|                   ingredients|
+------------------------------+
|apple bean cookie fruit kid...|
|bake bastille day bon appét...|
|dairy fennel gourmet new yo...|
|bon appétit dairy free dinn...|
|bake bon appétit california...|
|bacon basil bon appétit foo...|
|asparagus boil bon appétit ...|
|cocktail party egg fruit go...|
|beef ginger gourmet quick &...|
|dairy free gourmet ham lunc...|
+------------------------------+
only showing top 10 rows

我需要创建一个循环（也欢迎其他任何建议！）来循环遍历数据帧1，并通过“ like”将值与数据帧字符串进行比较，并提供匹配的总数。

所需结果：

+--------------------+-----+
|         ingredients|count|
+--------------------+-----+
|rebel crunch granola|  183|
|creamed honey       |   87|
|berry medley        |   67|
|spanish peanuts     |   10|
+--------------------+-----+

我知道以下代码有效：

reging.filter("ingredients like '%sugar%'").count()

并试图实现类似

for i in ingcomb:
    x = reging.select("ingredients").filter("ingredients like '%i%'").count()

但是无法让pyspark将ingcomb中的“ i”视为值，而不是字符i。

我尝试了以下解决方案 Spark Compare two dataframe and find the match count 但不幸的是，它们无法正常工作。我在GCP中运行此程序，尝试运行toPandas时出现错误-由于权限无法安装Pandas。

Answer 1

实际上，我们能够解决此问题，我们将首先在数据框中获取计数，然后再与联接匹配。请随时提出更好的建议。在这里编码的新手。

<?php   

$txtnum = "Called Number Call Type Call Time Call Duration Call Charges
9231332454834 SMS2/5/2019 9:31:15
AM-- Minutes 0.00 PKR 9230374555790 SMS2/4/2019 8:42:07
PM-- Minutes 0.00 PKR
";

?>

PySpark比较两个数据框并找到匹配计数

1 个答案: