
时间:2019-06-17 23:09:46

标签: python pandas





testdf = pd.DataFrame([
    [ 'BACKGROUND\nDiagnostic uncertainty in ALS has serious management implications and delays recruitment into clinical trials. Emerging evidence of presymptomatic disease-burden provides the rationale to develop diagnostic applications based on the evaluation of in-vivo pathological patterns early in the disease.\n\n\nOBJECTIVES\nTo outline and test a diagnostic classification approach based on an array of complementary imaging metrics in key disease-associated anatomical structures.\n\n\nMETHODS\nData from 75 ALS patients and 75 healthy controls were randomly allocated in a training and validation cohort. Spatial masks were created for anatomical foci which best discriminate patients from controls in the training sample. In a virtual brain biopsy, data was then retrieved from these key disease-associated brain regions. White matter diffusivity indices, grey matter T1-signal intensity values and basal ganglia volumes were evaluated as predictor variables in a canonical discriminant function.\n\n\nRESULTS\nFollowing predictor variable selection, a classification specificity of 85.5% and sensitivity of 89.1% was achieved in the training sample and 90% specificity and 90% sensitivity in the validation sample.\n\n\nDISCUSSION\nThis study evaluates disease-associated imaging measures in a dummy diagnostic application. Although larger samples will be required for robust validation, the study confirms the potential of multimodal quantitative imaging in future clinical applications.' , 'Entry1'], 
                       [ '\nProblem statement: The industrialization of the world from whole to s ite as a result of technological innovation made many industries adopt ing Information and Communication Technology (ICT) for processing of all their activities from i nception to completion, especially in the developed nations. But, the developing nations appear to make sluggish progress towards ICT adoption due to apprehensiveness that their fraudulent activities c an easily be traced. \nApproach: The purpose of this study was to evaluate the contractor’s perception t oward ICT innovation acceptance for construction site management and the effectiveness of the innova tion. A 519 questionnaire survey was employed for the data collection, while SPSS version 17.0 wa s used for the descriptive statistic and factorial analysis of the data. \nResults: The findings show ICT innovation was effective for site management but there were positive and negative factors that affec t the ICT innovation based on the contractors view. \nConclusion: By evaluating the ICT innovation, empirical eviden c has been provided for the ‘wait and see contractors’ to adopt ICT in construction site management and by making adequate provisions against the negative factors. ' , 'Entry2'], 
                       ['BACKGROUND AND PURPOSE\nRotator cuff tears are associated with secondary rotator cuff muscle pathology, which is definitive for the prognosis of rotator cuff repair. There is little information regarding the early histological and immunohistochemical nature of these muscle changes in humans. We analyzed muscle biopsies from patients with supraspinatus tendon tears.\n\n\nMETHODS\nSupraspinatus muscle biopsies were obtained from 24 patients undergoing arthroscopic repair of partial- or full-thickness supraspinatus tendon tears. Tissue was formalin-fixed and processed for histology (for assessment of fatty infiltration and other degenerative changes) or immunohistochemistry (to identify satellite cells (CD56+), proliferating cells (Ki67+), and myofibers containing predominantly type 1 or 2 myosin heavy chain (MHC)). Myofiber diameters and the relative content of MHC1 and MHC2 were determined morphometrically.\n\n\nRESULTS\nDegenerative changes were present in both patient groups (partial and full-thickness tears). Patients with full-thickness tears had a reduced density of satellite cells, fewer proliferating cells, atrophy of MHC1+ and MHC2+ myofibers, and reduced MHC1 content.\n\n\nINTERPRETATION\nFull-thickness tears show significantly reduced muscle proliferative capacity, myofiber atrophy, and loss of MHC1 content compared to partial-thickness supraspinatus tendon tears.' ,  'Entry3']
] )
testdf.columns = ['A', 'B']


A   B
0   BACKGROUND\nDiagnostic uncertainty in ALS has serious management implications and delays recruitment into clinical trials. Emerging evidence of presymptomatic disease-burden provides the rationale to develop diagnostic applications based on the evaluation of in-vivo pathological patterns early in the disease.\n\n\nOBJECTIVES\nTo outline and test a diagnostic classification approach based on an array of complementary imaging metrics in key disease-associated anatomical structures.\n\n\nMETHODS\nData from 75 ALS patients and 75 healthy controls were randomly allocated in a training and validation cohort. Spatial masks were created for anatomical foci which best discriminate patients from controls in the training sample. In a virtual brain biopsy, data was then retrieved from these key disease-associated brain regions. White matter diffusivity indices, grey matter T1-signal intensity values and basal ganglia volumes were evaluated as predictor variables in a canonical discriminant function.\n\n\nRESULTS\nFollowing predictor variable selection, a classification specificity of 85.5% and sensitivity of 89.1% was achieved in the training sample and 90% specificity and 90% sensitivity in the validation sample.\n\n\nDISCUSSION\nThis study evaluates disease-associated imaging measures in a dummy diagnostic application. Although larger samples will be required for robust validation, the study confirms the potential of multimodal quantitative imaging in future clinical applications. Entry1
1   \nProblem statement: The industrialization of the world from whole to s ite as a result of technological innovation made many industries adopt ing Information and Communication Technology (ICT) for processing of all their activities from i nception to completion, especially in the developed nations. But, the developing nations appear to make sluggish progress towards ICT adoption due to apprehensiveness that their fraudulent activities c an easily be traced. \nApproach: The purpose of this study was to evaluate the contractor’s perception t oward ICT innovation acceptance for construction site management and the effectiveness of the innova tion. A 519 questionnaire survey was employed for the data collection, while SPSS version 17.0 wa s used for the descriptive statistic and factorial analysis of the data. \nResults: The findings show ICT innovation was effective for site management but there were positive and negative factors that affec t the ICT innovation based on the contractors view. \nConclusion: By evaluating the ICT innovation, empirical eviden c has been provided for the ‘wait and see contractors’ to adopt ICT in construction site management and by making adequate provisions against the negative factors.\t Entry2
2   BACKGROUND AND PURPOSE\nRotator cuff tears are associated with secondary rotator cuff muscle pathology, which is definitive for the prognosis of rotator cuff repair. There is little information regarding the early histological and immunohistochemical nature of these muscle changes in humans. We analyzed muscle biopsies from patients with supraspinatus tendon tears.\n\n\nMETHODS\nSupraspinatus muscle biopsies were obtained from 24 patients undergoing arthroscopic repair of partial- or full-thickness supraspinatus tendon tears. Tissue was formalin-fixed and processed for histology (for assessment of fatty infiltration and other degenerative changes) or immunohistochemistry (to identify satellite cells (CD56+), proliferating cells (Ki67+), and myofibers containing predominantly type 1 or 2 myosin heavy chain (MHC)). Myofiber diameters and the relative content of MHC1 and MHC2 were determined morphometrically.\n\n\nRESULTS\nDegenerative changes were present in both patient groups (partial and full-thickness tears). Patients with full-thickness tears had a reduced density of satellite cells, fewer proliferating cells, atrophy of MHC1+ and MHC2+ myofibers, and reduced MHC1 content.\n\n\nINTERPRETATION\nFull-thickness tears show significantly reduced muscle proliferative capacity, myofiber atrophy, and loss of MHC1 content compared to partial-thickness supraspinatus tendon tears. Entry3


listStrings = { 
'\nIntroduction' , '\nCase' , 
'\nLiterature' , '\nBackground',  '\nRelated' , 
'\nMethods' , '\nMethod',
'\nTechniques', '\nMethodology',
'\nResults', '\nResult', '\nExperimental',
'\nExperiments', '\nExperiment',
'\nDiscussion' , '\nLimitations',
'\nConclusion' , '\nConclusions',
'\nConcluding' ,
'Introduction\n' , 'Case\n' , 
'Literature\n' , 'Background\n',  'Related\n' , 
'Methods\n' , 'Method\n',
'Techniques\n', 'Methodology\n',
'Results\n', 'Result\n', 'Experimental\n',
'Experiments\n', 'Experiment\n',
'Discussion\n' , 'Limitations\n',
'Conclusion\n' , 'Conclusions\n',
'Concluding\n' ,
'Introduction:' , 'Case:' , 
'Literature:' , 'Background:',  'Related:' , 
'Methods:' , 'Method:',
'Techniques:', 'Methodology:',
'Results:', 'Result:', 'Experimental:',
'Experiments:', 'Experiment:',
'Discussion:' , 'Limitations:',
'Conclusion:' , 'Conclusions:',
'Concluding:' ,



testdf2 = pd.DataFrame([
    [ 'BACKGROUND' , '\nDiagnostic uncertainty in ALS has serious management implications and delays recruitment into clinical trials. Emerging evidence of presymptomatic disease-burden provides the rationale to develop diagnostic applications based on the evaluation of in-vivo pathological patterns early in the disease.\n\nOBJECTIVES\nTo outline and test a diagnostic classification approach based on an array of complementary imaging metrics in key disease-associated anatomical structures.\n\n', 'Entry1'],
    ['METHODS', 'Data from 75 ALS patients and 75 healthy controls were randomly allocated in a training and validation cohort. Spatial masks were created for anatomical foci which best discriminate patients from controls in the training sample. In a virtual brain biopsy, data was then retrieved from these key disease-associated brain regions. White matter diffusivity indices, grey matter T1-signal intensity values and basal ganglia volumes were evaluated as predictor variables in a canonical discriminant function.\n\n', 'Entry1'],
    ['RESULTS', '\nFollowing predictor variable selection, a classification specificity of 85.5% and sensitivity of 89.1% was achieved in the training sample and 90% specificity and 90% sensitivity in the validation sample.\n\n', 'Entry1'],
    ['DISCUSSION', '\nThis study evaluates disease-associated imaging measures in a dummy diagnostic application. Although larger samples will be required for robust validation, the study confirms the potential of multimodal quantitative imaging in future clinical applications.' , 'Entry1'], 
                        ['\nResults:', ' The findings show ICT innovation was effective for site management but there were positive and negative factors that affec t the ICT innovation based on the contractors view. ', 'Entry2'],
                         ['\nConclusion:',' By evaluating the ICT innovation, empirical eviden c has been provided for the wait and see contractors to adopt ICT in construction site management and by making adequate provisions against the negative factors.', 'Entry2'], 
                       ['BACKGROUND',  'AND PURPOSE\nRotator cuff tears are associated with secondary rotator cuff muscle pathology, which is definitive for the prognosis of rotator cuff repair. There is little information regarding the early histological and immunohistochemical nature of these muscle changes in humans. We analyzed muscle biopsies from patients with supraspinatus tendon tears.\n\n', 'Entry3'],
                      [ 'METHODS', '\nSupraspinatus muscle biopsies were obtained from 24 patients undergoing arthroscopic repair of partial- or full-thickness supraspinatus tendon tears. Tissue was formalin-fixed and processed for histology (for assessment of fatty infiltration and other degenerative changes) or immunohistochemistry (to identify satellite cells (CD56+), proliferating cells (Ki67+), and myofibers containing predominantly type 1 or 2 myosin heavy chain (MHC)). Myofiber diameters and the relative content of MHC1 and MHC2 were determined morphometrically.\n\n',  'Entry3'],
                      [ 'RESULTS', '\nDegenerative changes were present in both patient groups (partial and full-thickness tears). Patients with full-thickness tears had a reduced density of satellite cells, fewer proliferating cells, atrophy of MHC1+ and MHC2+ myofibers, and reduced MHC1 content.\n\n\nINTERPRETATION\nFull-thickness tears show significantly reduced muscle proliferative capacity, myofiber atrophy, and loss of MHC1 content compared to partial-thickness supraspinatus tendon tears.', 'Entry3']
testdf2.columns = ['C' , 'D', 'E']


C   D   E
0   BACKGROUND  \nDiagnostic uncertainty in ALS has serious management implications and delays recruitment into clinical trials. Emerging evidence of presymptomatic disease-burden provides the rationale to develop diagnostic applications based on the evaluation of in-vivo pathological patterns early in the disease.\n\nOBJECTIVES\nTo outline and test a diagnostic classification approach based on an array of complementary imaging metrics in key disease-associated anatomical structures.\n\n    Entry1
1   METHODS Data from 75 ALS patients and 75 healthy controls were randomly allocated in a training and validation cohort. Spatial masks were created for anatomical foci which best discriminate patients from controls in the training sample. In a virtual brain biopsy, data was then retrieved from these key disease-associated brain regions. White matter diffusivity indices, grey matter T1-signal intensity values and basal ganglia volumes were evaluated as predictor variables in a canonical discriminant function.\n\n Entry1
2   RESULTS \nFollowing predictor variable selection, a classification specificity of 85.5% and sensitivity of 89.1% was achieved in the training sample and 90% specificity and 90% sensitivity in the validation sample.\n\n  Entry1
3   DISCUSSION  \nThis study evaluates disease-associated imaging measures in a dummy diagnostic application. Although larger samples will be required for robust validation, the study confirms the potential of multimodal quantitative imaging in future clinical applications.  Entry1
4   \nResults:  The findings show ICT innovation was effective for site management but there were positive and negative factors that affec t the ICT innovation based on the contractors view.  Entry2
5   \nConclusion:   By evaluating the ICT innovation, empirical eviden c has been provided for the wait and see contractors to adopt ICT in construction site management and by making adequate provisions against the negative factors.    Entry2
6   BACKGROUND  AND PURPOSE\nRotator cuff tears are associated with secondary rotator cuff muscle pathology, which is definitive for the prognosis of rotator cuff repair. There is little information regarding the early histological and immunohistochemical nature of these muscle changes in humans. We analyzed muscle biopsies from patients with supraspinatus tendon tears.\n\n    Entry3
7   METHODS \nSupraspinatus muscle biopsies were obtained from 24 patients undergoing arthroscopic repair of partial- or full-thickness supraspinatus tendon tears. Tissue was formalin-fixed and processed for histology (for assessment of fatty infiltration and other degenerative changes) or immunohistochemistry (to identify satellite cells (CD56+), proliferating cells (Ki67+), and myofibers containing predominantly type 1 or 2 myosin heavy chain (MHC)). Myofiber diameters and the relative content of MHC1 and MHC2 were determined morphometrically.\n\n Entry3
8   RESULTS \nDegenerative changes were present in both patient groups (partial and full-thickness tears). Patients with full-thickness tears had a reduced density of satellite cells, fewer proliferating cells, atrophy of MHC1+ and MHC2+ myofibers, and reduced MHC1 content.\n\n\nINTERPRETATION\nFull-thickness tears show significantly reduced muscle proliferative capacity, myofiber atrophy, and loss of MHC1 content compared to partial-thickness supraspinatus tendon tears. Entry3



最后,listStrings中的字符串可能重叠。重叠首先出现在哪一侧,或者为列选择了哪一侧,或者是否使用了组合(即'\ nResult;')都没有关系。如第二个示例数据帧的第4行和第5行所示。



1 个答案:

答案 0 :(得分:1)


# first we build a big regex pattern
pat = '|'.join(listStrings)

# find all keywords in the series
new_df = testdf.A.str.findall(pat)
# 1                    [\nResults, \nConclusion]
# 2                [BACKGROUND, METHODS, RESULT]
# Name: A, dtype: object

# find all the chunks by splitting the text with the found keywords
chunks = pd.concat([testdf.A.iloc[[i]].str.split('|'.join(new_df.iloc[i]), expand=True) 
             for i in range(len(testdf))]).stack()

# stack the keywords:
keys = new_df.str.join(' ').str.split(' ', expand=True).stack()

# out return dataframe
# note that we shift the chunks to match the keywords
pd.DataFrame({'D': keys, 'E': chunks.groupby(level=0).shift(-1)})


                D                                                  E
0 0    BACKGROUND  \nDiagnostic uncertainty in ALS has serious ma...
  1       METHODS  \nData from 75 ALS patients and 75 healthy con...
  2        RESULT  S\nFollowing predictor variable selection, a c...
  3    DISCUSSION  \nThis study evaluates disease-associated imag...
  4           NaN                                                NaN
1 0     \nResults  : The findings show ICT innovation was effecti...
  1  \nConclusion  : By evaluating the ICT innovation, empirical ...
  2           NaN                                                NaN
2 0    BACKGROUND   AND PURPOSE\nRotator cuff tears are associate...
  1       METHODS  \nSupraspinatus muscle biopsies were obtained ...
  2        RESULT  S\nDegenerative changes were present in both p...
  3           NaN                                                NaN



# first we build a big regex pattern
pat = '|'.join(listStrings)

# find all keywords in the series
new_df = testdf.A.str.findall(pat)
# 1                    [\nResults, \nConclusion]
# 2                [BACKGROUND, METHODS, RESULT]
# Name: A, dtype: object

# find all the chunks by splitting the text with the found keywords
chunks = pd.concat([testdf.A.iloc[[i]].str.split('|'.join(new_df.iloc[i]), expand=True) 
             for i in range(len(testdf))]).stack()

# stack the keywords:
keys = np.concatenate(new_df.values) # Flatten the keywords array
values = chunks.groupby(level=0).shift(-1).dropna().values
labels = np.concatenate([len(val) * [testdf['B'][ind]] for ind, val in enumerate(new_df.values)]) 
# out return dataframe
# note that we shift the chunks to match the keywords
pd.DataFrame({'C': keys, 'D': values, 'E': labels})


C   D   E
0   BACKGROUND  \nDiagnostic uncertainty in ALS has serious ma...   Entry1
1   METHODS \nData from 75 ALS patients and 75 healthy con...   Entry1
2   RESULTS \nFollowing predictor variable selection, a cl...   Entry1
3   DISCUSSION  \nThis study evaluates disease-associated imag...   Entry1
4   \nResult    s: The findings show ICT innovation was effect...   Entry2
5   \nConclusion    : By evaluating the ICT innovation, empirical ...   Entry2
6   BACKGROUND  AND PURPOSE\nRotator cuff tears are associate...    Entry3
7   METHODS \nSupraspinatus muscle biopsies were obtained ...   Entry3
8   RESULTS \nDegenerative changes were present in both pa...   Entry3