拆分列并格式化列值

时间:2016-11-07 18:19:20

标签: python csv pandas dataframe data-cleaning

我正在尝试格式化一个列数据。我可以找到拆分列的选项,因为它们之间有,,但我无法按照输出中的显示对其进行格式化。

输入

    TITLE,Issn
NATURE REVIEWS MOLECULAR CELL BIOLOGY,"ISSN 14710072, 14710080"
ANNUAL REVIEW OF IMMUNOLOGY,"ISSN 07320582, 15453278"
NATURE REVIEWS GENETICS,"ISSN 14710056, 14710064"
CA - A CANCER JOURNAL FOR CLINICIANS,"ISSN 15424863, 00079235"
CELL,"ISSN 00928674, 10974172"
ANNUAL REVIEW OF ASTRONOMY AND ASTROPHYSICS,"ISSN 15454282, 00664146"
NATURE REVIEWS IMMUNOLOGY,"ISSN 14741741, 14741733"
NATURE REVIEWS CANCER,ISSN 1474175X
ANNUAL REVIEW OF BIOCHEMISTRY,"ISSN 15454509, 00664154"
REVIEWS OF MODERN PHYSICS,"ISSN 00346861, 15390756"
NATURE GENETICS,ISSN 10614036
  1. 将issn列拆分为两列,因为它具有,
  2. 仅从列中删除单词ISSN
  3. 留下数字后4位数字放-
  4. 预期输出

        TITLE,Issn
    NATURE REVIEWS MOLECULAR CELL BIOLOGY,1471-0072, 1471-0080
    ANNUAL REVIEW OF IMMUNOLOGY,0732-0582, 1545-3278
    NATURE REVIEWS GENETICS,1471-0056, 1471-0064
    CA - A CANCER JOURNAL FOR CLINICIANS,1542-4863, 0007-9235
    CELL,0092-8674, 1097-4172
    ANNUAL REVIEW OF ASTRONOMY AND ASTROPHYSICS,1545-4282, 0066-4146
    NATURE REVIEWS IMMUNOLOGY,1474-1741, 1474-1733
    NATURE REVIEWS CANCER, 1474-175X
    ANNUAL REVIEW OF BIOCHEMISTRY,1545-4509, 0066-4154
    REVIEWS OF MODERN PHYSICS,0034-6861, 1539-0756
    NATURE GENETICS,1061-4036
    

    对熊猫的任何建议都表示赞赏..提前致谢

    更新
    当试图运行答案

    中提到的两个程序时
    import pandas as pd
    import re
    
    df = pd.read_csv('new_journal_list.csv', header='TITLE,Issn')
    
    '''
    df_split_num = df['Issn'].map(lambda x: x.split('ISSN ')[1].split(', '))
    df_dash_num = df_split_num.map(lambda x: [num[:4] + '-' + num[4:] for num in x])
    
    df_split_issn = pd.DataFrame(data=list(df_dash_num), columns=['Issn1', 'Issn2'])
    df[['Issn1', 'Issn2']] = df_split_issn
    del df['Issn']
    
    print df
    
    '''
    
    df[['Issn1','Issn2']] = (df.pop('Issn').str.extract('ISSN\s+([^,]+),?\s?(.*)', expand=True)
                       .apply(lambda x: x.str[:4]+'-'+x.str[4:]).replace(r'^-$', '', regex=True))
    
    print df
    

    在默认python 2.7中运行的任何一种情况都会出现以下错误

    Traceback (most recent call last):
      File "clean_journal_list.py", line 1, in <module>
        import pandas as pd
      File "/usr/local/lib/python2.7/dist-packages/pandas/__init__.py", line 25, in <module>
        from pandas import hashtable, tslib, lib
      File "pandas/src/numpy.pxd", line 157, in init pandas.hashtable (pandas/hashtable.c:38364)
    

    在python 3.4中运行时,会看到下面给出的错误

    File "clean_journal_list.py", line 21
        print df
               ^
    SyntaxError: invalid syntax
    

3 个答案:

答案 0 :(得分:2)

IIUC您可以使用Series.str.extract()apply()replace()方法执行此操作:

In [33]: df
Out[33]:
                                          TITLE                     Issn
0         NATURE REVIEWS MOLECULAR CELL BIOLOGY  ISSN 14710072, 14710080
1                   ANNUAL REVIEW OF IMMUNOLOGY  ISSN 07320582, 15453278
2                       NATURE REVIEWS GENETICS  ISSN 14710056, 14710064
3          CA - A CANCER JOURNAL FOR CLINICIANS  ISSN 15424863, 00079235
4                                          CELL  ISSN 00928674, 10974172
5   ANNUAL REVIEW OF ASTRONOMY AND ASTROPHYSICS  ISSN 15454282, 00664146
6                     NATURE REVIEWS IMMUNOLOGY  ISSN 14741741, 14741733
7                         NATURE REVIEWS CANCER            ISSN 1474175X
8                 ANNUAL REVIEW OF BIOCHEMISTRY  ISSN 15454509, 00664154
9                     REVIEWS OF MODERN PHYSICS  ISSN 00346861, 15390756
10                              NATURE GENETICS            ISSN 10614036

In [34]: df[['Issn1','Issn2']] = (df.pop('Issn')
    ...:                            .str.extract('ISSN\s+([^,]+),?\s?(.*)', expand=True)
    ...:                            .apply(lambda x: x.str[:4]+'-'+x.str[4:])
    ...:                            .replace(r'^-$', '', regex=True))
    ...:

In [35]: df
Out[35]:
                                          TITLE      Issn1      Issn2
0         NATURE REVIEWS MOLECULAR CELL BIOLOGY  1471-0072  1471-0080
1                   ANNUAL REVIEW OF IMMUNOLOGY  0732-0582  1545-3278
2                       NATURE REVIEWS GENETICS  1471-0056  1471-0064
3          CA - A CANCER JOURNAL FOR CLINICIANS  1542-4863  0007-9235
4                                          CELL  0092-8674  1097-4172
5   ANNUAL REVIEW OF ASTRONOMY AND ASTROPHYSICS  1545-4282  0066-4146
6                     NATURE REVIEWS IMMUNOLOGY  1474-1741  1474-1733
7                         NATURE REVIEWS CANCER  1474-175X
8                 ANNUAL REVIEW OF BIOCHEMISTRY  1545-4509  0066-4154
9                     REVIEWS OF MODERN PHYSICS  0034-6861  1539-0756
10                              NATURE GENETICS  1061-4036

答案 1 :(得分:1)

您需要为此添加一些错误处理,并将其包装在逐行迭代中,但这是它的要点:

leader, issns = line.split(" ISSN ")
numbers = issns.split(", ")

print leader, ', '.join([ num[:4] + '-' + num[4:] for num in numbers])

关键是将每一行拆分为“ISSN号码”和“其他所有”,然后将ISSN号码彼此分开并重新格式化。

答案 2 :(得分:1)

首先,拆分数字并为它们添加破折号。使用方便的地图功能:

df_split_num = df['Issn'].map(lambda x: x.split('ISSN ')[1].split(', '))
df_dash_num = df_split_num.map(lambda x: [num[:4] + '-' + num[4:] for num in x])

接下来,使用拆分的issn编号创建一个新的数据框,并将其放回原始数据框中:

df_split_issn = pd.DataFrame(data=list(df_dash_num), columns=['Issn1', 'Issn2'])
df[['Issn1', 'Issn2']] = df_split_issn
del df['Issn']