如何从熊猫数据框中的列中删除不需要的文本

时间:2019-12-24 05:45:55

标签: python-3.x pandas

 This is my DataFrame
                 Date         Value
0  "date": "1999-01-01  "s1":3.0000}
1  "date": "1999-01-02  "s1":3.0000}
2  "date": "1999-01-03  "s1":3.0000}
3  "date": "1999-01-04  "s1":3.0000}
4  "date": "1999-01-05  "s1":3.0000}

I want this DataFrame to be Transformed like this 

    Date             Value
    1999-01-01        3
    1999-01-02        3
    1999-01-03        3
    1999-01-04        3
    1999-01-05        3
    1999-01-06        3

我尝试过

cols = ['Date', 'Value']
for col in cols:
    DataAll[col] = DataAll[col].map(lambda x: str(x).lstrip('{}').rstrip('"date:")({)(:)(s1)(})'))

如果有人对此有解决方案,请提供帮助。 我已经为该解决方案花了很多时间,但我没有得到任何具有纯解决方案的解决方案。

2 个答案:

答案 0 :(得分:2)

您可以先为文本带{}链接文本方法,然后按:拆分文本,选择第二个列表,最后删除结尾的"和空格:

cols = ['Date', 'Value']
f = lambda x: x.astype(str).str.strip('{}').str.split(':').str[1].str.strip(' "')
DataAll[cols] = DataAll[cols].apply(f)

print (DataAll)
         Date   Value
0  1999-01-01  3.0000
1  1999-01-02  3.0000
2  1999-01-03  3.0000
3  1999-01-04  3.0000
4  1999-01-05  3.0000

如果列中的json,则首先将值转换为列表理解中的字典,然后传递给DataFrame构造函数:

print (DataAll)
                             json_col
0  {"date": "1999-01-01","s1":3.0000}
1  {"date": "1999-01-02","s1":3.0000}
2  {"date": "1999-01-03","s1":3.0000}
3  {"date": "1999-01-04","s1":3.0000}
4  {"date": "1999-01-05","s1":3.0000}

import ast

DataAll1 = pd.DataFrame([ast.literal_eval(x) for x in DataAll['json_col']])
print (DataAll1)
         date   s1
0  1999-01-01  3.0
1  1999-01-02  3.0
2  1999-01-03  3.0
3  1999-01-04  3.0
4  1999-01-05  3.0

答案 1 :(得分:1)

您只能在':'和'。之间找到字符串。如下

import numpy as np
import pandas as pd

pan = pd.DataFrame({'date': ["1999-01-01", "1999-01-02","1999-01-03","1999-01-04","1999-01-05"], 'Value': ['"s1":3.0000', '"s1":3.0000', '"s1":3.0000', '"s1":3.0000', '"s1":3.0000']})

def find_between( s, first, last ):
    try:
        start = s.index( first ) + len( first )
        end = s.index( last, start )
        return s[start:end]
    except ValueError:
        return ""

for index, row in pan.iterrows():
    print(row['date'],find_between(row['Value'], ':', '.'))

find_between函数将返回介于和之间的字符串。

Find string between two substrings找到的功能