我想替换 pyspark rdd中的多个字符串。我想按长度顺序(从最长到最短)替换这些字符串。该操作最终将替换大量文本,因此要考虑良好的性能。
问题示例:
在下面的示例中,我想替换字符串:
replace, text, is
具有,按各自的顺序(从长到短):
replacement1, replacement2, replacement3
即如果找到了字符串 replace ,则应将其替换为 replacement1 ,在本示例中,将首先搜索并替换该字符串。
字符串也将存储为pyspark rdd,如下所示:
+---------+------------------+
| string | replacement_term |
+---------+------------------+
| replace | replacement1 |
+---------+------------------+
| text | replacement2 |
+---------+------------------+
| is | replacement3 |
+---------+------------------+
请参阅需要用上述术语替换的rdd示例:
+----+-----------------------------------------+
| id | text |
+----+-----------------------------------------+
| 1 | here is some text to replace with terms |
+----+-----------------------------------------+
| 2 | text to replace with terms |
+----+-----------------------------------------+
| 3 | text |
+----+-----------------------------------------+
| 4 | here is some text to replace |
+----+-----------------------------------------+
| 5 | text to replace |
+----+-----------------------------------------+
我想替换一下,创建rdd输出,如下所示:
+----+----------------------------------------------------------------+
| id | text |
+----+----------------------------------------------------------------+
| 1 | here replacement3 some replacement2 to replacement1 with terms |
+----+----------------------------------------------------------------+
| 2 | replacement2 to replacement1 with terms |
+----+----------------------------------------------------------------+
| 3 | replacement2 |
+----+----------------------------------------------------------------+
| 4 | here replacement3 some replacement2 to replacement1 |
+----+----------------------------------------------------------------+
| 5 | replacement2 to replacement1 |
+----+----------------------------------------------------------------+
感谢帮助。
答案 0 :(得分:1)
以下代码段适用于Spark
/ Scala
和DataFrame
的 API 。
尝试使其适应RDD
和PySpark
// imports
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
// spark-session (not needed if your'e in spark-shell)
implicit val spark: SparkSession = SparkSession.builder().appName("SO").getOrCreate()
// you'll be reading it from somewhere
val dfToBeModified: DataFrame = spark.createDataFrame(
rowRDD = spark.sparkContext.parallelize(List(
Row(1, "here is some text to replace with terms"),
Row(2, "text to replace with terms"),
Row(3, "text"),
Row(4, "here is some text to replace"),
Row(5, "text to replace")
)),
schema = StructType(List(
StructField("id", IntegerType, false),
StructField("text", StringType, false)
))
)
// it should preferably be read not as a dataframe but as a sequence
val dfWithReplacements: DataFrame = spark.createDataFrame(
rowRDD = spark.sparkContext.parallelize(List(
Row("replace", "replacement1"),
Row("text", "replacement2"),
Row("is", "replacement3")
)),
schema = StructType(List(
StructField("string", StringType, false),
StructField("replacement_term", StringType, false)
))
)
// dfWithReplacements must not be too big or your executor will crash
val seqWithReplacements: Array[Row] = dfWithReplacements.collect()
// there you go
val dfWithModifications: DataFrame = seqWithReplacements.foldLeft(dfToBeModified) { (dfWithSomeModifications: DataFrame, row: Row) =>
dfWithSomeModifications.withColumn("text", regexp_replace(dfWithSomeModifications("text"), row(0).toString, row(1).toString))
}
答案 1 :(得分:1)
因此,假设您无法收集替代条款rdd, 而且还假设替换词是一个单词:
首先,您需要使文本变平(并记住单词顺序)。
然后您进行左连接以替换单词。
然后您重新设置原始文本。
replacement_terms_rdd = sc.parallelize([("replace", "replacement1"),
("text", "replacement2"),
("is", "replacement3")])
text_rdd = sc.parallelize([(1, "here is some text to replace with terms"),
(2, "text to replace with terms "),
(3, "text"),
(4, "here is some text to replace"),
(5, "text to replace")])
print (text_rdd\
.flatMap(lambda x: [(y[1], (x[0], y[0])) for y in enumerate(x[1].split())] )\
.leftOuterJoin(replacement_terms_rdd)\
.map(lambda x: (x[1][0][0], (x[1][0][1], x[1][1] or x[0]) ))\
.groupByKey().mapValues(lambda x: " ".join([y[1] for y in sorted(x)]))\
.collect())
结果:
[(1, 'here replacement3 some replacement2 to replacement1 with terms'), (2, 'replacement2 to replacement1 with terms'), (3, 'replacement2'), (4, 'here replacement3 some replacement2 to replacement1'), (5, 'replacement2 to replacement1')]