熊猫,删除最后一个“ _”之后的所有内容

时间:2019-11-06 21:26:44

标签: python pandas

我在下面的专栏中有以下类型的字符串。我想解析每个字符串的最后一个_之后的所有内容,如果没有_,则将该字符串保持原样。 (因为我在下面的尝试中只会排除没有_的字符串)

到目前为止,我已经在下面尝试过,{@ {3}}。但这只是解析第一个_

之后的所有内容

d6['SOURCE_NAME'] = d6['SOURCE_NAME'].str.split('_').str[0]

以下是我的SOURCE_NAME列中的一些示例字符串。

Stackoverflow_1234
Stack_Over_Flow_1234
Stackoverflow
Stack_Overflow_1234

预期:

Stackoverflow
Stack_Over_Flow
Stackoverflow
Stack_Overflow

任何帮助将不胜感激。

4 个答案:

答案 0 :(得分:3)

结合使用str.rsplitstr.get来获得所需的结果。 str.rsplit只是从末尾拆分一个字符串,而str.get获取pd.Series对象中迭代器的第n个元素。


答案

d6['SOURCE_NAME'] = df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)

n中的rsplit参数限制了输出的分割数,因此您只能将所有内容保留在最后一个'_'之前。

尽管使用pd.Series.apply的解决方案几乎快了一半,但我喜欢这一解决方案,因为它的语法更具表现力。如果您想使用pd.Series.apply解决方案(更快),请检查计时部分!

pandas documentation


示例

strs = ['Stackoverflow_1234',
        'Stack_Over_Flow_1234',
        'Stackoverflow',
        'Stack_Overflow_1234']
df = pd.DataFrame(data={'SOURCE_NAME': strs})

这将导致

print(df)
            SOURCE_NAME
0    Stackoverflow_1234
1  Stack_Over_Flow_1234
2         Stackoverflow
3   Stack_Overflow_1234

使用建议的解决方案:

df['SOURCE_NAME'].str.rsplit('_', 1).str.get(0)

0      Stackoverflow
1    Stack_Over_Flow
2      Stackoverflow
3     Stack_Overflow
Name: SOURCE_NAME, dtype: object

时间

有趣的是,使用pd.Series.str不一定比使用pd.Series.apply快:

import pandas as pd

df = pd.DataFrame(data={'SOURCE_NAME': ['stackoverflow_1234_abcd'] * 1000})

%timeit df['SOURCE_NAME'].apply(lambda x: x.rsplit('_', 1)[0])
497 µs ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)
1.04 ms ± 4.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# increasing the number of rows x 100
df = pd.concat([df] * 100)

%timeit df['SOURCE_NAME'].apply(lambda x: x.rsplit('_', 1)[0])
31.7 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)
84.1 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

答案 1 :(得分:1)

您可以尝试这样应用lambda:

public class MainActivity extends AppCompatActivity {

    private TextToSpeech textToSpeech;
    private TextView outputTextView;
    private static final int READ_REQUEST_CODE = 7;
//    private static final String FILE_PATH = "/sdcard/Download/Electronic_Tech.pdf";
    private String filePath;
    private Intent intent;

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.activity_main);

        FloatingActionButton fab = findViewById(R.id.fab);
        outputTextView = findViewById(R.id.output_text);

        textToSpeech = new TextToSpeech(getApplicationContext(), new TextToSpeech.OnInitListener() {
            @Override
            public void onInit(int i) {
                textToSpeech.setLanguage(Locale.US);
            }
        });

        /* permission read external storage */
        ActivityCompat.requestPermissions(this, new String[]{Manifest.permission.READ_EXTERNAL_STORAGE,
                Manifest.permission.WRITE_EXTERNAL_STORAGE}, PackageManager.PERMISSION_GRANTED);

        fab.setOnClickListener(new View.OnClickListener() {
            @Override
            public void onClick(View view) {
                intent = new Intent(Intent.ACTION_GET_CONTENT);
                intent.setType("*/*");
                startActivityForResult(intent, READ_REQUEST_CODE);
            }
        });
    }

    @Override
    protected void onActivityResult(int requestCode, int resultCode, Intent resultData) {
        if (requestCode == READ_REQUEST_CODE && resultCode == Activity.RESULT_OK) {
            if(resultData != null) {
                filePath = resultData.getData().getPath();
                Toast.makeText(MainActivity.this, filePath , Toast.LENGTH_LONG).show();
                openPdfFile();
            }
        }
    }


    public void openPdfFile() {
        Log.v("OPEN", filePath);
        File file = new File(filePath);
        String stringParser;
        try {
            PdfReader pdfReader = new PdfReader(file.getPath());
            stringParser = PdfTextExtractor.getTextFromPage(pdfReader, 1).trim();
            pdfReader.close();
            outputTextView.setText(stringParser);
            textToSpeech.speak(stringParser, TextToSpeech.QUEUE_FLUSH,null, null);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

希望有帮助!

答案 2 :(得分:1)

使用rsplit()返回您要实现的目标,您可以告诉它将字符串拆分多少次。

s = "Stack_Over_Flow_1234"
s.rsplit('_', 1)[0] # Split my string one time and get the first part of it

然后返回'Stack_Over_Flow'

答案 3 :(得分:1)

您可以使用string.split('_')函数将字符串分成每个下划线周围的子字符串列表,然后重新组合它们而无需最后一个元素。这是使用您的示例的片段:

a = ["Stackoverflow_1234", "Stack_Over_Flow_1234", "Stackoverflow", "Stack_Overflow_1234"]

for e in a:

    # Split the string into a list, separated at '_'
    splitStr = e.split("_")

    # If there is only 1 element, we can use it directly
    if len(splitStr) == 1:
        print(splitStr[0])

    # Slice off the final substring and join the remaining 
    # substrings back together with underscores
    else:
        print("_".join(splitStr[:-1]))