如何通过它的字符串分割大块的pandas数据帧?

时间:2016-11-04 02:29:36

标签: python list python-3.x pandas

我有一个pandas数据框,我通过附加一系列列表生成,主要由具有分隔符的字符串组成(" '\n'"),如下所示:

   content

0   American Regent/Luitpold (Reverified 10/26/2016)\nCompany Contact Information:\n800-645-1706\n\nPresentation Availability and Estimated Shortage Duration Related Information Shortage Reason (per FDASIA)\n2 mL single-dose vial, package of 10 (NDC 00517-2502-10) Available for NDC 00517-2502-10. Demand increase for the drug
1   Amphastar Pharmaceuticals, Inc./IMS (Reverified 08/18/2016)\nCompany Contact Information:\n800-423-4136\n\nPresentation Availability and Estimated Shortage Duration Related Information Shortage Reason (per FDASIA)\nCalcium Chloride Inj. USP, 10%, 10mL Luer-Jet Prefilled Syringe, (NDC 0548-3304-00), new (NDC 76329-3304-1) Product available Demand increase for the drug\nHospira, Inc. (Reverified 10/21/2016)
2   American Regent/Luitpold (Reverified 10/26/2016)\nCompany Contact Information:\n800-645-1706\n\nPresentation Availability and Estimated Shortage Duration Related Information Shortage Reason (per FDASIA)\n10%, 50 mL vial; Calcium (0.465 mEq/mL), Preservative Free (NDC 0517-3950-25) Unavailable for NDC 00517-3950-25. No product available for release. No plan to manufacture. American Regent is currently not releasing Calcium Gluconate 50 mL vial (NDC 00517-3950-25). Other\n10%, 100 mL vial; Calcium (0.465 mEq/mL), Preservative Free (NDC 0517-3900-25) Unavailable for NDC 00517-3900-25. American Regent is currently not releasing Calcium Gluconate 100 mL vial (NDC 0517-3900-25). Other\nFresenius Kabi USA, LLC (Revised 11/01/2016)
 .......
n   Apotex Corp. (Revised 05/16/2016)\nCompany Contact Information:\n800-706-5575\n\nPresentation\n1gm; (25 Vials) (NDC 60505-0749-5)\n1gm; (25 Vials)(NDC 60505-6093-5)\n10 gm; (10 Vials) (NDC 60505-0769-0)\n10 gm; (10 Vials) (NDC 60505-6094-0)\nNote:\nAvailable\nB. Braun Medical Inc. (Revised 05/16/2016)\n\n\nBaxter Healthcare (Revised 05/16/2016)\n\n\nFresenius Kabi USA, LLC (Revised 05/16/2016)\n\n\nHospira, Inc. (Revised 05/16/2016)\n\n\nSagent Pharmaceuticals (Revised 05/16/2016)\n\n\nSandoz (Revised 05/16/2016)\n\n\nWest-Ward Pharmaceuticals (Revised 05/16/2016)\n\n\nWG Critical Care (Revised 05/16/2016)
n-1 Apotex Corp. (Reverified 10/26/2016)\nCompany Contact Information:\n800-706-5575\n\nPresentation Availability and Estimated Shortage Duration Related Information Shortage Reason (per FDASIA)\nCefepime for Injection, USP 1 gm (10 Vials) (NDC 60505-6030-4) On backorder. Shortage duration is unknown. Requirements relating to complying with current good manufacturing practices (cGMP).\nCefepime for Injection, USP 2 gm (10 Vials)(NDC 60505-6031-4) On backorder. Shortage duration is unknown. Requirements relating to complying with current good manufacturing practices (cGMP).\nCefepime for injection, USP 1 gm (10 Vials) (NDC 60605-0834-04) On backorder. Shortage duration is unknown. Requirements relating to complying with current good manufacturing practices (cGMP).\nCefepime for injection, USP 2 gm (10 Vials) (NDC 60505-0681-4) On backorder. Shortage duration is unknown. Requirements relating to complying with current good manufacturing practices (cGMP).\nCefepime for injection, USP 1 gm (1 Vial) (NDC 60505-0834-00) On backorder. Shortage duration is unknown. Requirements relating to complying with current good manufacturing practices (cGMP).\nCefepime for injection, USP 2 gm (10 Vials) (NDC 60505-0681-0) On backorder. Shortage duration is unknown. Requirements relating to complying with current good manufacturing practices (cGMP).\nB. Braun Medical Inc. (New 07/22/2015)\n\n\nBaxter Healthcare (Reverified 10/25/2016)\n\n\nFresenius Kabi USA, LLC (Revised 11/01/2016)\n\n\nHospira, Inc. (Reverified 10/21/2016)\n\n\nSagent Pharmaceuticals (Revised 08/29/2016)\n\n\nWG Critical Care (Revised 06/08/2016)

如何通过新行\n将更多列中的数据框内容分开:

   col1              col2        col3        col4
0  Shire US Inc. (Reverified 07/01/2016)   and so  on.... 
1  Hospira, Inc. (Reverified 10/21/2016)   and so  on....  
2  Mission Pharmacal (Reverified 01/21/2015)   and so  on....  
....
n  Mission Pharmacal (Reverified 01/21/2015)   and so  on....  

我试图:

df['col'] = df['content'].str.split('\n', expand = true)

显然,我得到的项目数量错误,通过45,放置意味着1.此外,我正在做:

df = pd.DataFrame(lis, columns = ['content'])

我无法使用sep

1 个答案:

答案 0 :(得分:1)

类似问题here

df = pd.DataFrame(['The quick brown\n fox jumps \nover the \n lazy dog',
'The quick brown\n fox jumps \nover the \n lazy dog',
'The quick brown\n fox jumps \nover the \n lazy dog','The quick brown\n fox jumps \nover the \n lazy dog'], columns = ['data'])

foo = lambda x: pd.Series([i for i in reversed(x.split('\n'))])
rev = df['data'].apply(foo)

修改 在这里讨论之后是更新的代码,它将多个文件加载到一个数据帧中:

allFiles_df = None
for it, currFile in enumerate(files):

    df = pd.read_csv(currFile, sep = '\n', header = None)
    df.columns = ['data']

    splitFunc = lambda x: pd.Series([i for i in reversed(x.split('\\n'))])

    df = df['data'].apply(splitFunc)
    df = df.stack().to_frame().reset_index().drop(['level_1'],axis = 1)
    df = df[df[0].str.len() >2]
    df['fileNo'] = it

    allFiles_df = pd.concat([allFiles_df,rev])

allFiles_df.columns = ['rowNo','text','fileNo']

要注意的关键事项: ' \ n'是原始数据中的文本,因此它被读入python作为' \\ n'。 read_csv中的sep关键字不允许分隔多个字符,这就是您遇到问题的原因。

这将输出找到每个字符串的文件和行号。它假定files变量包含带路径的文件名列表。