将Lambda应用于Recode(棘手)字符串到数字

时间:2017-07-14 06:02:37

标签: string list python-3.x lambda

我有大量的NFL场景数据集,但为了便于说明,让我把它简化为2个观察列表。像这样:

data = [[scenario1],[scenario2]]

以下是数据集包含的内容:

data[0][0]
>>"It is second down and 3. The ball is on your opponent's 5 yardline. There is 3 seconds left in the fourth quarter. You are down by 3 points."

data[1][0]
>>"It is first down and 10. The ball is on your 20 yardline. There is 7 minutes left in the third quarter. You are down by 10 points."

我不能像这样用字符串格式构建任何模型。所以我想将这些场景重新编码为新的列(或者如果你愿意的话,还有特征值)作为定量值。我以为我应该先把数据框放到一边:

down = 0
yards = 0
yardline = 0
seconds = 0
quarter = 0
points = 0

data = [[scenario1, down, yards, yardline, seconds, quarter, points], [scenario2, yards, yardline, seconds, quarter, points]]

现在是棘手的部分,有些我必须从场景列中的信息填充新列。棘手,因为例如,如果存在“对手”这个词,则在第二句中,这意味着我们必须将其计算为100-无论码数是多少。在上面的scenario1变量中,它应该是100-5 = 95。

起初我认为我应该将所有数字分开并丢弃这些词,但正如上面所指出的,实际上需要一些词来正确分配定量值。我从来没有用这么微妙的东西做过一个lambda。或许,一个lambda不是正确的方法?我对任何/所有建议持开放态度。

对于强化,这是我想要看到的(如果我输入,则来自scenario1

data[0][1:]
>>2,3,95,3,4,-3

谢谢

1 个答案:

答案 0 :(得分:1)

lambda不是你想要去的地方。 Python的re模块是你的朋友:)

from re import search

def getScenarioData(scenario):
    data = []

    ordinals_to_nums = {'first':1, 'second':2, 'third':3, 'fourth':4}
    numerals_to_nums = {
        'zero':0, 'one':1, 'two':2, 'three':3, 'four':4,
        'five':5, 'six':6, 'seven':7, 'eight':8, 'nine':9
    }

    # Downs
    match = search('(first|second|third|fourth) down and', scenario)
    if match:
        raw_downs = match.group(1)
        downs = ordinals_to_nums[raw_downs]
        data.append(downs)

    # Yards
    match = search('down and (\S+)\.', scenario)
    if match:
        raw_yards = match.group(1)
        data.append(int(raw_yards))

    # Yardline
    match = search("(oponent's)? (\S+) yardline", scenario)
    if match:
        raw_yardline = match.groups()
        yardline = 100-int(raw_yardline[1]) if raw_yardline[0] else int(raw_yardline[1])
        data.append(yardline)

    # Seconds
    match = search('(\S+) (seconds|minutes) left', scenario)
    if match:
        raw_secs = match.groups()
        multiplier = 1 if raw_secs[1] == 'seconds' else 60
        data.append(int(raw_secs[0]) * multiplier)

    # Quarter
    match = search('(\S+) quarter', scenario)
    if match:
        raw_quarter = match.group(1)
        quarter = ordinals_to_nums[raw_quarter]
        data.append(quarter)

    # Points
    match = search('(up|down) by (\S+) points', scenario)
    if match:
        raw_points = match.groups()
        if raw_points:
            polarity = 1 if raw_points[0] == 'up' else -1
            points = int(raw_points[1]) * polarity
        else:
            points = 0
        data.append(points)

    return data

就个人而言,我发现存储像[[scenario, <scenario_data>], ...]这样的数据有点奇怪,但是要将数据添加到每个场景中:

for s in data:
    s.extend(getScenarioData(s[0]))

我建议使用字典列表,因为使用像data[0][3]之类的索引可能会在一两个月之后混淆:

def getScenarioData(scenario):
    # instead of data = []
    data = {'scenario':scenario}

    # instead of data.append(downs)
    data['downs'] = downs

    ...

scenarios = ['...', '...']
data = [getScenarioData(s) for s in scenarios]

编辑:如果您想从序列中获取值,请使用get方法阻止提升KeyError,因为如果密钥为get,则None默认为for s in data: print(s.get('quarter')) 找不到:

 var auth = $firebaseAuth();
  auth.$createUserWithEmailAndPassword(user.email, user.password)
  .then(function(firebaseUser) {
          console.log(firebaseUser)
          var ref = new Firebase(FIREBASE_URL + "users")
              .child(firebaseUser.uid).set({
                  date: Firebase.ServerValue.TIMESTAMP,
                  firstname: user.fname,
                  lastname: user.lname,
                  uid: firbaseUser.uid,
                  email: user.email,
              });
      })
      .catch(function(error) {
          console.log(error);
      });