在字符之间提取和替换字符串

时间:2017-06-24 06:14:33

标签: python regex string pandas replace

我已清理了$mysqli = mysqli_connect("localhost", "root" , "" , "login info") 中编码的数据。使用'utf-8',我得到了.str.extract()[(u'text')]格式之间的文字,但我的代码没有注册垃圾/ unicode字符'text 和类似的文字的类型。我该如何删除它们?

输入:

"\u09xx"

我的代码:

{"HT" : ["([u'SoccerTips', u'FootballTips'],)", "([u'\u092b\u094c\u091c\u0940', u'FixedMatch', u'CT2017Final'],)"]}

输出: -

df1 = df.drop('HT', axis=1).join(
             df.HT
             .str
             .split(expand=True)
             .stack()
             .reset_index(drop=True, level=1)
             .rename('HT')           
             )

df1['HT'] = df1['HT'].str.extract("u+(\'[^\']*)", expand=False).fillna('')
df1['HT'] = "#" + df1['HT']

预期输出: -

{"HT" : ["#'SoccerTips" , "#'FootballTips", "#'\u092b\u094c\u091c\u0940", "#'FixedMatch", "#'CT2017Final"]}

1 个答案:

答案 0 :(得分:0)

可能的解决方案:

import pandas as pd

# the input
df1= {"HT" : ["([u'SoccerTips', u'FootballTips'],)", "([u'\u092b\u094c\u091c\u0940', u'FixedMatch', u'CT2017Final'],)"]}

# convert to Dataframe
df1= pd.DataFrame(df1)

# cleaning
df1.HT.replace('\(\[|\],\)','', regex=True, inplace=True)
df1.HT.replace("u'[^\x00-\x7f]*'","", regex=True, inplace=True)
df1.HT.replace("u'([^\']+)'",'#\\1', regex=True, inplace= True)
df1.HT= df1.HT.str.split(', ')

# final result
df1= {'HT':[j for i in df1.HT for j in i]}

# output: df1 -> {'HT': ['#SoccerTips', '#FootballTips', '', '#FixedMatch', '#CT2017Final']}