如何在熊猫大数据框(python3.x)中比较两个字符串?

时间:2019-04-23 10:48:03

标签: python python-3.x pandas

我有2个excel文件中的两个DF。

第一个文件(awcProjectMaster)(1500条记录)

projectCode    projectName
  100101       kupwara
  100102       kalaroos
  100103       tangdar

第二个文件(村庄主文件)(超过1000万条记录)

villageCode    villageName
   425638          wara
   783651          tangdur
   986321          kalaroo

我需要比较projectName和villageName以及匹配百分比。 以下代码可以正常运行,但是速度很慢。我该如何以更有效的方式做同样的事情。

import pandas as pd
from datetime import datetime

df = pd.read_excel("C:\\Users\\Desktop\\awcProjectMaster.xlsx")
df1 = pd.read_excel("C:\\Users\\Desktop\\prjToVillageStateWise\\stCodeVillage1To6.xlsx")


def compare(prjCode, prjName, stCode, stName, dCode, dName, sdCode, sdName, vCode, vName):
    with open(r"C:\\Users\\Desktop\\prjToVillageStateWise\\stCodeVillage1To6.txt", "a") as f:
        percentMatch = 0
        vLen = len(vName)
        prjLen = len(prjName)
        if vLen > prjLen:
            if vName.find(prjName) != -1:
                percentMatch = (prjLen / vLen) * 100
                f.write(prjCode + "," + prjName + "," + vCode + "," + vName + "," + str(round(percentMatch)) + "," + stCode + "," + stName + "," + dCode + "," + dName + sdCode + "," + sdName + "\n")
            else:
                res = 0
                # print(res)
        elif prjLen >= vLen:
            if prjName.find(vName) != -1:
                percentMatch = (vLen / prjLen) * 100
                f.write(prjCode + "," + prjName + "," + vCode + "," + vName + "," + str(round(percentMatch)) + "," + stCode + "," + stName + "," + dCode + "," + dName + sdCode + "," + sdName + "\n")
            else:
                res = 0
                # print(res)
    f.close()


for idx, row in df.iterrows():
    for idxv, r in df1.iterrows():
        compare(
            str(row["ProjectCode"]),
            row["ProjectName"].lower(),
            str(r["StateCensusCode"]),
            r["StateName"],
            str(r["DistrictCode"]),
            r["DistrictName"],
            str(r["SubDistrictCode"]),
            r["SubDistrictNameInEnglish"],
            str(r["VillageCode"]),
            r["VillageNameInEnglish"].lower(),
        )

1 个答案:

答案 0 :(得分:1)

您对字符串的距离度量不太准确,但是如果它对您有用,那就很好。 (不过,您可能希望研究其他选项,例如内置difflib或Python-Levenshtein模块。)

如果您确实确实需要成对比较1,500 x 10,000,000条记录,那么事情肯定会花费一些时间,但是我们可以很轻松地完成几件事来加快速度:

  • 仅打开日志文件一次;那里有一些开销,有时很重要
  • 将比较函数重构为一个单独的单元,然后应用lru_cache()备注修饰符以确保每对仅被比较一次,并且随后的结果被缓存在内存中。 (此外,请查看如何对vName / prjName对进行排序–由于两个字符串的实际顺序无关紧要,因此最终只有一半的缓存大小。)

然后获得一般清洁

  • 使用csv模块将CSV流式传输到文件中(输出格式与代码略有不同,但是您可以使用dialect参数将其更改为csv.writer())。

希望这会有所帮助!

import pandas as pd
from datetime import datetime
from functools import lru_cache
import csv

df = pd.read_excel("C:\\Users\\Desktop\\awcProjectMaster.xlsx")
df1 = pd.read_excel("C:\\Users\\Desktop\\prjToVillageStateWise\\stCodeVillage1To6.xlsx")

log_file = open(r"C:\\Users\\Desktop\\prjToVillageStateWise\\stCodeVillage1To6.txt", "a")
log_writer = csv.writer(log_file)


@lru_cache()
def compare_vname_prjname(vName, prjName):
    vLen = len(vName)
    prjLen = len(prjName)
    if vLen > prjLen:
        if vName.find(prjName) != -1:
            return (prjLen / vLen) * 100
    elif prjLen >= vLen:
        if prjName.find(vName) != -1:
            return (vLen / prjLen) * 100
    return None


def compare(prjCode, prjName, stCode, stName, dCode, dName, sdCode, sdName, vCode, vName):
    # help the cache decorator out by halving the number of possible pairs:
    vName, prjName = sorted([vName, prjName])
    percent_match = compare_vname_prjname(vName, prjName)
    if percent_match is None:  # No match
        return False
    log_writer.writerow(
        [
            prjCode,
            prjName,
            vCode,
            vName,
            round(percent_match),
            stCode,
            stName,
            dCode,
            dName + sdCode,
            sdName,
        ]
    )
    return True


for idx, row in df.iterrows():
    for idxv, r in df1.iterrows():
        compare(
            str(row["ProjectCode"]),
            row["ProjectName"].lower(),
            str(r["StateCensusCode"]),
            r["StateName"],
            str(r["DistrictCode"]),
            r["DistrictName"],
            str(r["SubDistrictCode"]),
            r["SubDistrictNameInEnglish"],
            str(r["VillageCode"]),
            r["VillageNameInEnglish"].lower(),
        )