Question

我有一个植物科学名称的查找表。我想使用此查找表来验证我有数据输入人员输入数据的其他表。有时他们会将这些科学名称的格式化错误，所以我正在编写一个脚本来尝试标记错误。

格式化每个名称的方式非常具体。例如'Sonchus arvensis L.'特别需要将Sonchus中的S以及最后的L大写。我有大约1000种不同的植物，每种植物的格式都不同。以下是一些例子：

Linaria dalmatica（L.）Mill。
Knautia arvensis（L.）Coult。
Alliaria petiolata（M。Bieb。）Cavara＆amp;大
Berteroa incana（L.）DC。
Aegilops cylindrica Host

正如你所看到的，所有这些字符串的格式都非常不同（即某些字母大写，有些不是，有时括号，符号，句号等）

我的问题是，有没有办法动态读取查找表中每个字符串的格式，以便我可以将其与数据输入人输入的值进行比较，以确保其格式正确？在下面的脚本中，我测试（第一个elif），通过大写所有值来查看值是否在查找表中，以使匹配工作，无论格式如何。在下一个测试（第二个elif）中，我可以通过与值的查找表值进行比较来排序测试格式。这将根据格式返回不匹配的记录，但它没有具体告诉您返回不匹配记录的原因。

我认为要做的是，读取查找表中的字符串值并以某种方式动态读取每个字符串的格式，以便我可以专门识别错误（即一个字母应该大写，它不是'吨）

到目前为止，我的代码段看起来像这样：

        # Determine if the field heaidng is in a list I built earlier
        if "SCIENTIFIC_NAME" in fieldnames:
            # First, Test to see if record is empty
            if not row.SCIENTIFIC_NAME:
                weedPLineErrors.append("SCIENTIFIC_NAME record is empty")
            # Second, Test to see if value is in the lookup table, regardless of formatting.
            elif row.SCIENTIFIC_NAME.upper() not in [x.upper() for x in weedScientificTableList]:
                weedPLineErrors.append("COMMON_NAME (" + row.SCIENTIFIC_NAME + ")" + " is not in the domain table")
            # Third, if the second test is satisfied, we know the value is in the lookup table. We can then test the lookup table again, without capitalizing everything to see if there is an exact match to account for formatting.
            elif row.SCIENTIFIC_NAME not in weedScientificTableList:
                weedPLineErrors.append("COMMON_NAME (" + row.SCIENTIFIC_NAME + ")" + " is not formatted properly")                        
            else:
                pass

我希望我的问题足够明确。我查看了字符串模板，但我认为它没有做我想做的事情......至少不是动态的。如果有人能指出我朝着更好的方向发展，那我就是所有的目光......但也许我可以在这个方面共进午餐。

谢谢，麦克

Answer 1

要解决标点符号问题，可以使用正则表达式。

>>> import re
>>> def tokenize(s):
...     return re.split('[^A-Za-z]+', s) # Split by anything that isn't a letter
...
>>> tokens = tokenize('Alliaria petiolata (M. Bieb.) Cavara & Grande')
>>> tokens
['Alliaria', 'petiolata', 'M', 'Bieb', 'Cavara', 'Grande']

要解决大写问题，您可以使用

>>> tokens = [s.lower() for s in tokens]

从那里，您可以用标准格式重写条目，例如

>>> import string
>>> ## I'm not sure exactly what format  you're looking for
>>> first, second, third = [string.capitalize(s) for s in tokens[:3]]
>>> "%s %s (%s)" % (first, second, third)
'Alliaria Petiolata (M)'

这可能不是您想要的确切格式，但也许这会让您朝着正确的方向前进。

Answer 2

您可以从查找表中构建名称字典。假设您将名称存储在列表中（将其命名为correctList），您可以编写一个删除所有格式的函数，可能会降低或更高的情况并将结果存储在字典中。例如，以下是构建字典的示例代码


def removeFormatting(name):
    name = name.replace("(", "").replace(")", "")
    name = name.replace(".", "")
    ...
    return name.lower()

formattingDict = dict([(removeFormatting(i), i) for i in correctList])

现在您可以比较数据输入人员输入的字符串。让我们说它在一个名为inputList的列表中。


for name in inputList:
    unformattedName = removeFormatting(name)
    lookedUpName = formattingDict.get(unformattedName, "")
    if not lookedUpName:
        print "Spelling mistake:", name
    elif lookedUpName != name:
        print "Formatting error"
        print differences(name, lookedUpName)

差异功能可以填充一些规则，如括号，“。”等等


def differences(inputName, lookedUpName):
    mismatches = []
    # Check for brackets
    if "(" in lookedUpName:
        if "(" not in inputName:
            mismatches.append("Bracket missing")
    ...
    # Add more rules
    return mismatches

这会回答你的问题吗？

动态读取字符串的格式，Python

2 个答案: