例如 数据集,这是csv文件-
Name , Country, Income
Alan Turing, UK, 1000
James Clark, US, 5000
我想对“国家/地区”和“收入”进行一些转换,但将“姓名”显示为
名称
图灵
克拉克(J Clark)
答案 0 :(得分:0)
由于您使用Python标记了问题并询问了数据帧,因此可以使用pandas replace方法:
import pandas as pd
data = [['Alan Turing', 'UK', 1000],
['James Clark', 'US', 5000]]
df = pd.DataFrame(data=data, columns=['Name', 'Country', 'Income'])
df['Name'] = df.Name.str.replace('(\w)\w* (\w+)', r'\1 \2', regex=True)
print(df)
输出
Name Country Income
0 A Turing UK 1000
1 J Clark US 5000
模式(\w)\w* (\w+)
是一个正则表达式,用于捕获名称的第一个字母和(整个)姓氏。然后,它将字符串替换为名称的首字母和姓r'\1 \2'
。
答案 1 :(得分:0)
from pyspark.sql.functions import split,concat,lit
myValues = [('Alan Turing','UK',1000),('James Clark','US',5000)]
df = sqlContext.createDataFrame(myValues,['Name','Country','Income'])
df.show()
+-----------+-------+------+
| Name|Country|Income|
+-----------+-------+------+
|Alan Turing| UK| 1000|
|James Clark| US| 5000|
+-----------+-------+------+
df = df.withColumn('Name', concat(split(df['Name'], ' ')[0].substr(0,1), lit(' '), split(df['Name'], ' ')[1]))
df.show()
+--------+-------+------+
| Name|Country|Income|
+--------+-------+------+
|A Turing| UK| 1000|
| J Clark| US| 5000|
+--------+-------+------+
如果名称为Alan Turing Müller
,则上面的代码将失败。以下代码更健壮-
from pyspark.sql.functions import concat, instr, length
myValues = [('Alan Turing Müller','UK',1000),('James Clark','US',5000)]
df = sqlContext.createDataFrame(myValues,['Name','Country','Income'])
df.show()
+------------------+-------+------+
| Name|Country|Income|
+------------------+-------+------+
|Alan Turing Müller| UK| 1000|
| James Clark| US| 5000|
+------------------+-------+------+
df = df.withColumn('Name', concat(df['Name'].substr(0,1),df['Name'].substr(instr(df['Name'],' '),length(df['Name'])-instr(df['Name'],' ')+1)))
df.show()
+---------------+-------+------+
| Name|Country|Income|
+---------------+-------+------+
|A Turing Müller| UK| 1000|
| J Clark| US| 5000|
+---------------+-------+------+