在python中将%%中包含的字符串转换为小写

时间:2017-11-29 19:24:57

标签: python regex pyspark-sql

我有pyspark数据框,其中一个字段的值包含在%% .. %%中。所附内容不包括在案件中。我想将它们转换成小写。

以下是数据框的快照。enter image description here

列中的文字如下所示

https://www.xxxxxxxx.co.nz/Activities|http://www.xxxxxxxx.co.nz/things-to-do/search?location=%%t.Trip_Intrip_1_dest_City_1%%

https://images.trvl-media.com/media/content/expus/email/2016/us/banner/images/image_stor-34461_09_600x250.jpg|%%mis_lx_Offers_mod_Images.LargeImageURL%%

我想将上述文字转换为以下格式:

https://www.xxxxxxxx.co.nz/Activities|http://www.xxxxxxxx.co.nz/things-to-do/search?location=%%t.trip_intrip_1_dest_city_1%%

https://images.trvl-media.com/media/content/expus/email/2016/us/banner/images/image_stor-34461_09_600x250.jpg|%%mis_lx_offers_mod_images.largeimageurl%%

只有%%附带的字符串才能转换为小写

3 个答案:

答案 0 :(得分:2)

由于字符串在Python中是不可变的,因此您必须重新分配新值。因此,我认为,只需迭代字符串就会更好(因为在评论中你说要避免使用split)。 我在想这样的事情

new=''
f=0
for i in textstr:
    if i == '%':
        f += 1
    if (f/2)%2 == 1:
        new+=i.lower()
    else:
        new+=i

或者使用正则表达式

答案 1 :(得分:2)

您可以使用简单的正则表达式:

  • 查找要替换的所有序列
  • 用等号小写替换每个序列
import re

link1 = 'https://images.trvl-media.com/media/content/expus/email/2016/us/banner/images/image_stor-34461_09_600x250.jpg|%%mis_lx_Offers_mod_Images.LargeImageURL%%'
link2 = 'https://www.xxxxxxxx.co.nz/Activities|http://www.xxxxxxxx.co.nz/things-to-do/search?location=%%t.Trip_Intrip_1_dest_City_1%%'
links = [link1, link2]

for idx, link in enumerate(links):
    lowers = re.findall(r'%%.*?%%', link)
    for x in lowers:
        links[idx] = re.sub(r'%%.*?%%', x.lower(), link)

for link in links:
    print(link)

输出:

https://images.trvl-media.com/media/content/expus/email/2016/us/banner/images/image_stor-34461_09_600x250.jpg|%%mis_lx_offers_mod_images.largeimageurl%%
https://www.xxxxxxxx.co.nz/Activities|http://www.xxxxxxxx.co.nz/things-to-do/search?location=%%t.trip_intrip_1_dest_city_1%%

答案 2 :(得分:0)

使用@mentalita建议的正则表达式

input_df:

>>> df.show(truncate=False)
+----+---------------------------------+
|col1|col2                             |
+----+---------------------------------+
|1   |http://%%FOO%%|some_string%%BAR%%|
|2   |http://%%FOO%%|some_string       |
+----+---------------------------------+

代码:

def convert_to_lower(link):
    target_strings = re.findall(r'%%.*?%%', link)
    for x in target_strings:
            link = re.sub(x, x.lower(), link)
    return link

convert_to_lower_udf = F.udf(lambda x: convert_to_lower(x))
df = df\
    .withColumn('converted_strings', convert_to_lower_udf('col2'))

output_df:

>>> df.show(truncate=False)
+----+---------------------------------+---------------------------------+
|col1|col2                             |converted_strings                |
+----+---------------------------------+---------------------------------+
|1   |http://%%FOO%%|some_string%%BAR%%|http://%%foo%%|some_string%%bar%%|
|2   |http://%%FOO%%|some_string       |http://%%foo%%|some_string       |
+----+---------------------------------+---------------------------------+