Question

我目前正在尝试解决一个问题，即我有一个很大的文本字符串（摘要），并且正在该摘要中搜索某些单词。基于某个类别中存在的多个单词之一，我希望能够创建相应标签的数组，如下所示：

ground = ['car', 'motorbike']
air = ['plane']
colour = ['blue', 'red']

| Summary                | Tag_Array            |
|------------------------|----------------------|
| This is a blue car     | ['ground', 'colour'] |
| This is red motorbike  | ['ground', 'colour'] |
| This is a plane        | ['air']              |

这里的想法是先读取每个摘要，然后在Tag_Array列中创建一个数组，其中包含与摘要文本关联的各个标签。地面标签可以基于任何数量的潜在选择，在这种情况下，摩托车和汽车都可以使地面标签返回。

在功能上，我使用的是一种非常糟糕的方法，而且非常冗长，因此我的目的是在Pyspark中找到最合适的方法来实现这一目标。

    df = (df
        .withColumn("summary_as_array", f.split('summary', " "))
        .withColumn("tag_array", f.array(
            f.when(f.array_contains('summary_as_array', "car"), "ground").otherwise(""),
            f.when(f.array_contains('summary_as_array', "motorbike"), "ground").otherwise("")
            )
        )
    )

Answer 1

如果您可以将代码转换成这样的键值对，

tagDict = {'ground':['car', 'motorbike'],'air':['plane'],'colour':['blue','red']}

然后，我们可以创建一个UDF以遍历summary中的单词和值以获取键，该键将成为标签。一个简单的解决方案，

l = [('This is a blue car',),('This is red motorbike',),('This is a plane',)]
df = spark.createDataFrame(l,['summary'])

tag_udf = F.udf(lambda x : [k for k,v in tagDict.items() if any(itm in x for itm in v)])
df = df.withColumn('tag_array',tag_udf(df['summary']))
df.show()
+---------------------+----------------+
|summary              |tag_array       |
+---------------------+----------------+
|This is a blue car   |[colour, ground]|
|This is red motorbike|[colour, ground]|
|This is a plane      |[air]           |
+---------------------+----------------+

希望这会有所帮助。

读取字符串并创建提到的子字符串数组

1 个答案: