创建逐行读取文件的唯一数据框

时间:2017-10-19 09:34:22

标签: python r pandas

我有一个需要更改为数据框的file.txt(制表符分隔),基本上是逐行排序文件并为最终数据帧创建唯一列。另外,写" Na"没有信息建立时为空值。请注意" CS _"作为"之后的模式:"。 我正在考虑大熊猫数据框架,但我们将非常感谢您的帮助。 R中的建议也值得赞赏。

输入:

Japan        Cases:CS_1    People:CS_2    Life:CS_3
Australia    People:CS_4   Transportation:CS_Ground   
Spain        Life:CS_5     Language:CS_Spanish

输出:

             Cases     People    Life     Transportation     Language
Japan        CS_1      CS_2      CS_3     Na                 Na
Australia    Na        CS_4      Na       CS_Ground          Na
Spain        Na        Na        CS_5     Na                 CS_Spanish

2 个答案:

答案 0 :(得分:0)

假设:

>>> from io import StringIO
>>> infile = """Japan Cases:CS_1 People:CS_2 Life:CS_3
... Australia People:CS_4 Transportation:CS_Ground   
... Spain Life:CS_5 Language:CS_Spanish"""

逐行遍历文件:

  1. 将第一列与其余列分开
  2. 使用适当的分隔符(例如\s\t)分割其余部分
  3. 将每个元素拆分为键值对,其中键是最终数据框中所需的列标题
  4. 添加第一列值(即国家/地区名称)并使用临时标题名称(例如Key
  5. 将字典存储在列表中
  6. [代码]:

    >>> row_dicts = []
    >>> for line in StringIO(infile):
    ...     k, _, therest =line.partition(' ')  # Step 1.
    ...     _row = {kv.split(':')[0]:kv.split(':')[1] for kv in therest.split()}  # Step 2-3. 
    ...     _row['Key'] = k  # Step 4. 
    ...     row_dicts.append(_row)  # Step 5.
    ... 
    

    将词典列表转换为pd.DataFrame

    >>> pd.DataFrame(row_dicts)
      Cases        Key    Language  Life People Transportation
    0  CS_1      Japan         NaN  CS_3   CS_2            NaN
    1   NaN  Australia         NaN   NaN   CS_4      CS_Ground
    2   NaN      Spain  CS_Spanish  CS_5    NaN            NaN
    

    使用.set_index将国家/地区Key列设置为索引。

    >>> df.set_index('Key')
              Cases    Language  Life People Transportation
    Key                                                    
    Japan      CS_1         NaN  CS_3   CS_2            NaN
    Australia   NaN         NaN   NaN   CS_4      CS_Ground
    Spain       NaN  CS_Spanish  CS_5    NaN            NaN
    

答案 1 :(得分:0)

你可以使用(使用生成器和理解):

import re, pandas as pd

string = """
Japan        Cases:CS_1    People:CS_2    Life:CS_3
Australia    People:CS_4   Transportation:CS_Ground   
Spain        Life:CS_5     Language:CS_Spanish
"""

rx = re.compile(r'(?P<key>\w+):(?P<value>CS_\d+)')
rxc = re.compile(r'(?P<country>\w+)')

dft = (dict({'Country': item.group('country')}, **{m.group('key'): m.group('value') for m in rx.finditer(line)})
        for line in string.split("\n")
        for item in [rxc.match(line)]
        if item)

df = pd.DataFrame(dft)
print(df)

这使用两个正则表达式,一个用于国家/地区,另一个用于键/值对。然后构建df