Question

我有两个文本文件：

其中一种具有以下翻译/别名的形式：

，另一行每行三个条目：

34 456 9900
111 333 444
234 2 562
...

如果可能的话，我想翻译第二列，例如，我希望输出数据框具有以下行：

34, 99999, 9900
111, 333, 444
234, 278, 562

读取文本文件效果很好。但是，翻译b列确实有问题。这是我目前的基本代码结构：

translation = sc.textFile("transl.txt")\
    .map(lambda line: line.split(" "))

def translate(string):
    x = translation.filter(lambda x: x[0] == string).collect()
    if x == []:
        return string
    return x[0][1]

d = sc.textFile("text.txt")\
    .map(lambda line: line.split(" "))\
    .toDF(["a", "b", "c"])\
    .withColumn("b", translate(d.b))\

对于最后一行，一切工作正常。我知道将功能应用于spark中的列并不容易，但是我不知道该怎么做。

Answer 1

如果您将两个文件作为数据帧导入，则将它们连接起来的方法略有不同。我在下面显示了一个示例：

# Sample DataFrame's from provided example
import pandas as pd
translations = pd.DataFrame({
    'Key': [123,2,456],
    'Translation': [456,278,99999]
    })  

entries = pd.DataFrame({
    'A': [34,11,234],
    'B': [456,333,2],
    'C': [9900,444,562]
    })

导入文件后，我们可以使用左键通过查找键合并它们

df = pd.merge(entries, translations, left_on='B', right_on='Key', how='left')

但是，这将为我们留下NaN所在的列，其中找不到查找。为了解决这个问题，我们从“ B”取值，同时用我们的查找值覆盖原始的“ B”列。

df['B'] = df['Translation'].mask(pd.isna, df['B'])

现在，我们需要删除其他列以获得您请求的结果：

df.drop(columns=['Key', 'Translation'])

df现在将如下所示：

    A   B       C
0   34  99999   9900
1   11  333     444
2   234 278     562

Answer 2

您可以使用left join来实现。请查看下面的注释代码：

import pyspark.sql.functions as F

l1 = [
(123, 456)
,(2, 278)
,(456, 99999)
]

l2 = [
(34, 456, 9900)
,(111, 333, 444)
,(234, 2, 562)
]

df1=spark.createDataFrame(l1, ['one1', 'two1'])
df2=spark.createDataFrame(l2, ['one2', 'two2', 'three2'])

#creates an dataframe with five columns one1, two1, one2, two2, three2
df = df2.join(df1, df2.two2 == df1.one1 , 'left')

#checks if a value in your dictionary dataframe is avaiable, if not it will keep the current value
#otherwise the value will be translated
df = df.withColumn('two2', F.when(F.col('two1').isNull(), F.col('two2') ).otherwise(F.col('two1')))

df = df.drop('one1', 'two1')

df.show()

输出：

+----+-----+------+
|one2| two2|three2|
+----+-----+------+
| 111|  333|   444|
| 234|  278|   562|
|  34|99999|  9900|
+----+-----+------+

将数据框转换为第二个数据框

2 个答案: