我通过执行以下代码创建了3个数据框。 sample.csv
id|code|name|Lname|mname
2|AA|BB|CC|DD|
sample1.csv
id|code|name|Lname|mname
1|A|B|C|D|
sample2.csv
id1|code1|name1|Lnam|mnam
3|AAA|BBB|CCC|DDD|
如果所有标题列2文件(sample1,sample2)的平均匹配率为85%,那么我必须使用模糊逻辑比较数据帧的标题。然后我必须打印两个文件是相同的。
example :
sample1.csv vs sample2.csv
+---------+--------+-----+
|f1_lab | f2_lab|score|
+---------+--------+-----+
| id | id1 | 80 |
| code | code1 | 89 |
| name | name1| 89 |
| Lname | Lnam | 89 |
| mname | mnam | 89 |
+---------+--------+-----+
我的最终输出将是例如80+89+89+89+89/5=87.2
的平均得分(平均得分)。
如果平均分数高于80,则必须打印输出sample1 and sample2 matched ,
。
像那样,我必须将所有文件头与其他文件头进行比较。我需要识别所有匹配的文件。
请帮助我。
请找到以下代码。
from pyspark.sql import SQLContext,SparkSession
spark = SparkSession.builder.appName("ALS").getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='True', inferschema='false').option("delimiter", "|").load('C:/Users/test/Desktop/sample1.csv')
df1 = sqlContext.read.format('com.databricks.spark.csv').options(header='True', inferschema='false').option("delimiter", "|").load('C:/Users/test/Desktop/sample2.csv')
df2 = sqlContext.read.format('com.databricks.spark.csv').options(header='True', inferschema='false').option("delimiter", "|").load('C:/Users/test/Desktop/sample3.csv')
lab= [col_name for col_name in df.columns]
lab1=[col_name1 for col_name1 in df1.columns]
lab2=[col_name1 for col_name1 in df2.columns]
lab1head=sc.parallelize(lab).zipWithIndex()
a = spark.createDataFrame(lab1head,['label1', "Index"])
lab1head=sc.parallelize(lab1).zipWithIndex()
a1 = spark.createDataFrame(lab1head,['label2', "Index"])
lab2head=sc.parallelize(lab1).zipWithIndex()
a1 = spark.createDataFrame(lab2head,['label3', "Index"])
from fuzzywuzzy import fuzz
def match_name(name, list_names, min_score=0):
max_score = -1
# Returning empty name for no match as well
max_name = ""
# Iternating over all names in the other
for name2 in list_names:
#Finding fuzzy match score
score = fuzz.ratio(name, name2)
# Checking if we are above our threshold and have a better score
if (score > min_score) & (score > max_score):
max_name = name2
max_score = score
return (max_name, max_score)
dict_list = []
for name in a.label1:
# Use our method to find best match, we can set a threshold here
match = match_name(name, a1.label2, 75)
dict_ = {}
dict_.update({"labhead" : name})
dict_.update({"labhead1" : match[0]})
dict_.update({"score" : match[1]})
dict_list.append(dict_)