Question

我有一个包含信息data1的数据框，想添加一列data2，其中仅包含data1的名称：

       data1                                         data2
0      info  name: Michael Jackson      New York     Michael Jackson
1      info 12 name: Michael Jordan III Los Angeles  Michael Jordan III

您知道我该怎么做吗？

Answer 1

没有明确的定界符，这并非易事，因为名称中有两个空格，多个名称长度（2个单词，3个单词），并且尾随列也可能有多个单词带有空格。

拆分字符串可以实现部分解决方案：

df['data2'] = df['data1'].str.split(': ').str[-1]

>>> print(df)

                                          data1                           data2
0     info  name: Michael Jackson      New York   Michael Jackson      New York
1  info 12 name: Michael Jordan III Los Angeles  Michael Jordan III Los Angeles

如果您有“城市”列表，则可以完成完整的解决方案：

def replace(string, substitutions):
    """Replaces multiple substrings in a string."""
    substrings = sorted(substitutions, key=len, reverse=True)
    regex = re.compile('|'.join(map(re.escape, substrings)))
    return regex.sub(lambda match: substitutions[match.group(0)], string)

# List of cities to remove from strings
cities = ['New York', 'Los Angeles']
# Dictionary matching each city with the empty string
substitutions = {city:'' for city in cities}

# Splitting to create new column as above
df['data2'] = df['data1'].str.split(': ').str[-1]
# Applying replacements to new column
df['data2'] = df['data2'].map(lambda x: replace(x, substitutions).strip())

>>>print(df)

                                          data1               data2
0     info  name: Michael Jackson      New York     Michael Jackson
1  info 12 name: Michael Jordan III Los Angeles  Michael Jordan III

使用carlsmith替换功能。

如何从列中获取特定值并在Python / Panda中添加为新列？

1 个答案: