我正在尝试清理一些数据,并且正在努力在Python / Pandas中这样做。我有一系列电视节目标题的数据。我想做以下事情:
所以这是我的意见:
Brooklyn 99 103
Hit The Floor 110
输出:
Brooklyn 99
Hit The Floor
作为一个单独的函数(或函数),我想删除任何其他季节/剧集格式及其后的任何字符串:
输入
Hot in Cleveland s6 ep03
Mutt & Stuff #111
LHH ATL 08/31a HD
LHH ATL 04/04 Check
Esther With Hot Chicks Ep. 1
Suspect 2/24
Suspect 2/24 HD
输出
Hot in Cleveland
Mutt & Stuff
LHH ATL
LHH ATL
Esther With Hot Chicks
Suspect
Suspect
我写了这样的函数:
def digit(value):
return value.isdigit()
def another(value):
li = value.split(" ")
x = len(filter(digit, value))
ind = li.index( str(filter(digit, li)[0]) )
try:
if x > 1:
return " ".join(li[:ind+1])
else:
value.str.replace(r'(\D+).*', r'\1').str.replace(r'\s+.$', '').str.strip()
except:
return value.str.replace(r'(\D+).*', r'\1').str.replace(r'\s+.$', '').str.strip()
data["LongTitleAdjusted"] = data["Long Title"].apply(another)
data["LongTitleAdjusted"]
但是我收到了这个错误:
AttributeError Traceback (most recent call last)
<ipython-input-49-3526b96a8f5a> in <module>()
15 return value.str.replace(r'(\D+).*', r'\1').str.replace(r'\s+.$', '').str.strip()
16
---> 17 data["LongTitleAdjusted"] = data["Long Title"].apply(another)
18 data["LongTitleAdjusted"]
C:\Users\lehmank\AppData\Local\Continuum\Anaconda2\lib\site- packages\pandas\core\series.pyc in apply(self, func, convert_dtype, args, **kwds)
2167 values = lib.map_infer(values, lib.Timestamp)
2168
-> 2169 mapped = lib.map_infer(values, f, convert=convert_dtype)
2170 if len(mapped) and isinstance(mapped[0], Series):
2171 from pandas.core.frame import DataFrame
pandas\src\inference.pyx in pandas.lib.map_infer (pandas\lib.c:62578)()
<ipython-input-49-3526b96a8f5a> in another(value)
13 value.str.replace(r'(\D+).*', r'\1').str.replace(r'\s+.$', '').str.strip()
14 except:
---> 15 return value.str.replace(r'(\D+).*', r'\1').str.replace(r'\s+.$', '').str.strip()
16
17 data["LongTitleAdjusted"] = data["Long Title"].apply(another)
AttributeError: 'unicode' object has no attribute 'str'
for regex
答案 0 :(得分:0)
这样可以处理您的样本数据集:
df['title'].str.replace(r'(\D+).*', r'\1').str.replace(r'\s+.$', '').str.strip()
但它也会将Brooklyn 99
转换为Brooklyn