Question

我有1百万条记录，我想为此尝试火花。我有项目列表，并希望使用此列表项在记录中执行查找。

l = ['domestic',"private"]
text = ["On the domestic front, growth seems to have stalled, private investment and credit off-take is feeble, inflation seems to be bottoming out and turning upward, current account situation is not looking too promising, FPI inflows into debt and equity have slowed, and fiscal deficit situation of states is grim.", "Despite the aforementioned factors, rupee continues to remain strong against the USD and equities continue to outperform.", "This raises the question as to whether the asset prices are diverging from fundamentals and if so when are they expected to fall in line. We examine each of the above factors in a little more detail below.Q1FY18 growth numbers were disappointing with the GVA, or the gross value added, coming in at 5.6 percent. Market participants would be keen to ascertain whether the disappointing growth in Q1 was due to transitory factors such as demonetisation and GST or whether there are structural factors at play. There are silver linings such as a rise in core GVA (GVA excluding agri and public services), a rise in July IIP (at 1.2%), pickup in activity in the cash-intensive sectors, pick up in rail freight and containers handled by ports.However, there is a second school of thought as well, which suggests that growth slowdown could be structural. With demonetisation and rollout of GST, a number of informal industries have now been forced to enter the formal setup."]
res = {}
for rec in text:
    for word in l:
        if word in rec:
            res[rec] = 1
            break
print res

这是一个简单的python脚本和我想用分布式方式执行pyspark（这个相同的代码会工作吗？）的相同逻辑，以减少执行时间。

你能指导我怎么做。我很抱歉，因为我很新兴，你的帮助会受到很大的反响。

Answer 1

在对spark context和/或spark session进行实例化后，您必须将您的记录列表转换为dataframe：

df = spark.createDataFrame(
    sc.parallelize(
        [[rec] for rec in text]
    ), 
    ["text"]
)
df.show()

    +--------------------+
    |                text|
    +--------------------+
    |On the domestic f...|
    |Despite the afore...|
    |This raises the q...|
    +--------------------+

现在，如果l中的单词存在与否，您可以检查每一行：

sc.broadcast(l)
res = df.withColumn("res", df.text.rlike('|'.join(l)).cast("int"))
res.show()

    +--------------------+---+
    |                text|res|
    +--------------------+---+
    |On the domestic f...|  1|
    |Despite the afore...|  0|
    |This raises the q...|  0|
    +--------------------+---+

rlike用于执行正则表达式匹配
sc.broadcast用于将对象l复制到每个节点，因此他们不必在驱动程序上获取它

希望这有帮助

如何在分布式环境中运行pyspark代码

1 个答案: