Question

我有一个熊猫数据集，其中包含一个名为['title']的列和字符串值，例如“ Robert Hall 2015 Viognier”和“ Woodinville Wine Cellars 2012 Reserve”。我试图遍历每一行以将年份提取为整数，但是字符串彼此不同，并且年份并非都在同一位置。

Answer 1

您可以将str.extract方法与正则表达式一起使用：

df['title'].str.extract('\d{4}').astype(int)

Here是一门有关正则表达式的速成课程（有关摘要，请参见右侧的“课程笔记”）。

Answer 2

请发布您的代码。提示：

import re

mystring =  "Woodinville Wine Cellars 2012 Reserve"

match = re.search('\d{4}', mystring )
print(match.group(0))
'2012'

这将适用于包含4位数字日期的任何字符串。

Answer 3

您可以使用正则表达式检查字符串是否连续包含4位数字，并使用match提取它们。

/**
 * Get a year from the given title.
 * @param {string} title The title to extract the year from.
 * @returns {?number} The extracted year. If undefined is returned a year could not be found.
 */
function getYearFromTitle (title)
{
    // Make sure that the title is a string
    if (typeof title !== "string") throw new Error("Typeof title must be a string!");

    // Do a regular expression search for 4 digits
    const results = title.match(/\d{4}/);

    // If results is null, return undefined.
    if (!results) return;

    // Return the first occurance of 4 digits as a number.
    return Number(results[0]);
}

注意：这是JavaScript代码，您必须用python编写等效代码。

从熊猫数据框中的字符串列表中提取“年份”

3 个答案: