Question

我有一个带有“名称”列的dataframe-df，如下所示：

Names
AL GHAITHA & AL MOOSA
AL ASEEL ELECTRONICS T
SUNRISE SUPERMARKET-QU
EMARAT-AL SAFIYAH(6735
LULU CENTRE LLC EFT TE
THE MAX

代码：

remove_letters = ['AL ', 'THE ']

# my function below :

def remove_start_words(df, col, letters):
    for l in letters:
        for i in df.index:
            x = df.at[i, col]
            if x.startswith(l):
                df.at[i, col] = x[len(l):]
            else:
                df.at[i, col] = x

def remove_strings(self, df, col):
    for i in df.index:
        x = df.at[i, col]
        x = x.split(' ')
        if len(x) > 1:
            if len(x[1]) > 2:
                x[1] = ''.join(e for e in x[1] if e.isalnum())
                x = ' '.join(x[0:2])
                df.at[i, col] = x
            else:
                df.at[i, col] = x[0]
        else:
            df.at[i, col] = df.at[i, col]


def remove_end_digits(self, df, col):
    for i in df.index:
        x = df.at[i, col]
        df.at[i, col] = x.rstrip(string.digits)

# calling my function
remove_start_words(df=df, col='Names',
                          letters=remove_letters)

remove_strings(df=df, col='Names')
remove_end_digits(df=df, col='Names')

现在的问题是我有一个超过一百万列值的数据框。我的代码优化不好吗？如何获得优化的解决方案？

问题1：我可以理解，我已经使用了2个导致缓慢的循环（其中1个用于remove_letters，其他的用于所有列值）。

有更好的方法吗？在这里，我可以检查列值是否以remove_letters列表中提到的字母开头并将其一键剥离？

问题2和3：函数的目标是什么-“ remove_strings”：从列名称中仅获取2个字符串。例如：ASEEL ELECTRONICS T 输出将是：ASEEL ELECTRONICS

有没有更快的功能：remove_strings，remove_end_digits

主要问题：这三项功能能否一并完成？

预期的数据框：

Names
GHAITHA
ASEEL ELECTRONICS
SUNRISE SUPERMARKET
EMARAT-AL SAFIYAH
LULU CENTRE
MAX

注意：函数“ remove_start_words”应该检查是否有任何提到的字母以“名称”开头，如果是，则将其删除。例如：“ AL THEMAX”应为“ THEMAX”，而不应为“ MAX”（同时删除AL和THE）

谢谢。

Answer 1

您可以使用以下替换方法：

                                <!-- category image -->
                                <div class="row">          
                                    <div class="col-md-6">
                                        <div class="form-group">
                                            <label for="category">Category Image</label>
                                            <br/>
                                            <div id="updCategoryPreview"></div>  
                                            <input type="file" class="img" id="upd-category" name="upd-category">
                                        </div>
                                    </div>
                                </div>                                      
                                <!-- banner image -->        
                                <div class="row">                                                
                                    <div class="col-md-6">
                                        <div class="form-group">
                                            <label for="banner">Category Banner</label>
                                            <br>
                                            <div id="updBannerPreview"></div>
                                            <input type="file" class="img" id="upd-banner" name="upd-banner">
                                        </div>
                                    </div>
                                </div>

Answer 2

由于您说过只希望删除句子开头的单词，因此可以使用正则表达式：

import pandas as pd

file_path = 'file3.xlsx'

df = pd.read_excel(file_path)

words_to_remove = ["THE", "AL"]
regular_expression = '^' + '|'.join(words_to_remove)

df.Names = df.Names.apply(lambda x : re.sub(regular_expression, "", x))

regular_expression表达式变量在这种情况下将包含^ THE | AL，表示字符串开头的THE或AL。

Answer 3

在Google上进行的几分钟搜索告诉我

CREATE TRIGGER VIP_Monitor  
ON [ReportServer].[dbo].[Catalog] 
AFTER INSERT, UPDATE   
AS
    DECLARE 
        @TestPath NVARCHAR(MAX),
        @TestDataSource NVARCHAR(MAX), 
        @WrongPath NVARCHAR(MAX)

    SET @TestPath = '/VIP-Area/'
    SET @TestDataSource = 'dsDWH_VIP'

    IF @TestDataSource = (SELECT Cat1.[Name] AS [DatasourceName]
                          FROM [ReportServer].[dbo].[Catalog] AS Cat1
                          LEFT JOIN [ReportServer].[dbo].[DataSource] AS DS1 ON Cat1.ItemID = DS1.Link
                          LEFT JOIN [ReportServer].[dbo].[Catalog] AS Cat2 ON DS1.ItemID = Cat2.ItemID
                          WHERE Cat1.[ItemID] = 'B5DE8D20-894E-4D38-8340-164A0DE61F0F')

        IF @TestPath != (SELECT LEFT(Cat1.[Path], 10) AS [DatasourceName]
                         FROM [ReportServer].[dbo].[Catalog] AS Cat1
                         LEFT JOIN [ReportServer].[dbo].[DataSource] AS DS1 ON Cat1.[ItemID] = DS1.Link
                         LEFT JOIN [ReportServer].[dbo].[Catalog] AS Cat2 ON DS1.[ItemID] = Cat2.[ItemID]
                         WHERE Cat1.[ItemID] = 'B5DE8D20-894E-4D38-8340-164A0DE61F0F')

            SET @WrongPath = (SELECT LEFT(Cat1.[Path], 10) AS [DatasourceName]
                              FROM [ReportServer].[dbo].[Catalog] AS Cat1
                              LEFT JOIN [ReportServer].[dbo].[DataSource] AS DS1 ON Cat1.[ItemID] = DS1.Link
                              LEFT JOIN  [ReportServer].[dbo].[Catalog] AS Cat2 ON DS1.[ItemID] = Cat2.[ItemID]
                              WHERE Cat1.[ItemID] = 'B5DE8D20-894E-4D38-8340-164A0DE61F0F')

    DELETE FROM [ReportServer].[dbo].[Catalog] AS Cat1
    WHERE Cat1.[Name] = ### Inserted Report Name ? ### 

    EXEC msdb.dbo.sp_send_dbmail  
            @profile_name = 'Admin',  
            @recipients = 'test@test.de',  
            @body = 'The VIP-Report was built in ' + @WrongPath ,  
            @subject = 'Warning: VIP-Report in false Folder';

应该可以解决问题。

从字符串开头删除提供的字母列表

3 个答案: