使用Pandas在Python中提取相同的命名值

时间:2018-06-26 06:18:29

标签: python pandas csv dataframe extract

我正在编写一个Python程序,以.csv文件的一列中提取多个值。

这是我的代码:

import csv
import pandas as pd

# read items with column name
df=pd.read_csv('D:\\My Documents\\Skype_Call_Session\\logs\\2018-06\\18\\skype_session_av.csv', header=0)

# extract values
df['FromIPAddr'] = df['QoEReport'].str.extract(r',"\FromIPAddr\":"\s*([^\.]*)\s*\","\ToIPAddr', expand=False)
df['ToIPAddr'] = df['QoEReport'].str.extract(r',"\ToIPAddr\":"\s*([^\.]*)\s*\","\FromBssid', expand=False)
df['Stream_1_PacketLossRate'] = df['QoEReport'].str.extract(r',\s*([^\.]*)\s*\.', expand=False)
df['Stream_1_RoundTrip'] = df['QoEReport'].str.extract(r',\s*([^\.]*)\s*\.', expand=False)
df['Stream_1_JitterInterArrival'] = df['QoEReport'].str.extract(r',\s*([^\.]*)\s*\.', expand=False)
df['Stream_2_PacketLossRate'] = df['QoEReport'].str.extract(r',\s*([^\.]*)\s*\.', expand=False)
df['Stream_2_RoundTrip'] = df['QoEReport'].str.extract(r',\s*([^\.]*)\s*\.', expand=False)
df['Stream_2_JitterInterArrival'] = df['QoEReport'].str.extract(r',\s*([^\.]*)\s*\.', expand=False)
df['OverallAvgNetworkMOS'] = df['QoEReport'].str.extract(r',\s*([^\.]*)\s*\.', expand=False)

# OUTPUT TO NEW CSV
df.to_csv('D:\\My Documents\\Skype_Call_Session\\logs\\2018-06\\18\\extracted_av.csv', index=False, header=True)`

到目前为止,测试进行得很好,但是我陷入了一个问题,即提取两个值,而周围的字符都相同,并分别使用Stream_1和{{ 1}},如代码所示。但是Stream_2这次将无法正常工作。

这是我要提取的QoEReport列中一个单元格的一部分:

df['QoEReport'].str.extract

例如,在一个单元格中有两个}],"AudioStreams":[{"JitterInterArrival":10,"JitterInterArrivalMax":24,"PacketLossRate":0.01353227,"PacketLossRateMax":0.09027778,"BurstDensity":null,"BurstDuration":null,"BurstGapDensity":null,"BurstGapDuration":null,"BandwidthEst":25245423,"RoundTrip":520,"RoundTripMax":11099,"PacketUtilization":2843,"RatioConcealedSamplesAvg":0.02746676,"ConcealedRatioMax":0.01598402,"PayloadDescription":"SIREN","AudioSampleRate":16000,"AudioFECUsed":true,"SendListenMOS":null,"OverallAvgNetworkMOS":3.487248,"DegradationAvg":0.2727518,"DegradationMax":0.2727518,"NetworkJitterAvg":253.0633,"NetworkJitterMax":1149.659,"JitterBufferSizeAvg":220,"JitterBufferSizeMax":1211,"PossibleDataMissing":false,"StreamDirection":"FROM-to-TO"},{"JitterInterArrival":10,"JitterInterArrivalMax":24,"PacketLossRate":0.01342051,"PacketLossRateMax":0.09027778,"BurstDensity":null,"BurstDuration":null,"BurstGapDensity":null,"BurstGapDuration":null,"BandwidthEst":2347573,"RoundTrip":721,"RoundTripMax":1703,"PacketUtilization":2906," ,它们都被PacketLossRateJitterInterArrivalMax包围,尽管我可以用数字来表示差异,但无法知道确切的值因为它们每次都会改变。

有人知道如何解决吗?非常感谢!

*************************************更新********* *******************************

我要提取的一列值:

,"PacketLossRateMax":

1 个答案:

答案 0 :(得分:0)

Coulmn值为JSON,您可以简单地解析JSON并查找键值:

这是一个从共享的JSON中提取(PacketLossRate)值的示例:

df['Stream_1_PacketLossRate'] = df['QoEReport']['AudioStreams'][0]['PacketLossRate']