在熊猫系列中拆分数据

时间:2019-11-26 11:34:35

标签: python pandas

我有使用API​​调用构建的数据框。我将API调用120次,并获得1000x31数据集,每次调用API时都会追加该数据集。

def load_full2(times):
    dfs = []
    item_count = 0
    while item_count <= times:
        response = requests.post(url_2,data=json.dumps(data_two),headers=headers)
        response_json = response.json()
        result = pd.io.json.json_normalize(response_json['hits']['hits'])
        item_count+=1
        dfs.append(result)


    df = pd.concat(dfs, ignore_index=True)
    df.to_csv("export2.csv", encoding='utf-8', index=False)

我导出的最终数据集如下:

120000x31

id    _index    _score     _source.agent    _source.cookie                                                                                                                                  .source.id    _source.log    _source.keys    _source.name    _source.category    _source.class    _source.companyid    _source.cname    _source.ip    _source.method    _source.process    _source.skid    _source.severity    _source.sysname    _source.template    _source.time    _source.country    _source.event    _source.hostname    _source.ipip    _source.namespace    _source.refer    _source.request_url    _source.type
n/a    n/a      n/a        n/a              __cfduid=d118f225fac35345d9e1d87e533b596ec1574680126; gclid=EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; full_path=google.com       n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a           n/a       https://google.com/au/?gclid=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE

我的主要兴趣是“ _source.cookie”和“ _source.request_url”列。我的目标是将2个新列添加到我的数据集中,第一个是Gclid from cookie,该列将保存 gclid = 之后的值,该值以结尾; 。第二列将是Glid_from_url,它将保留在 gclid = click_id =

我想要的输出看起来像这样:

120000x33

_service.request_url

我对编程还很陌生,不确定我将如何前进以及如何对其进行编码。我会在每次构建文件的过程中尝试拆分,同时循环从我感兴趣的2列中拆分字符串吗?还是在编译完整文件后再这样做?

第二个问题是,在 _source.request_url 列中,该值是在id ... _source.cookie ... _source.request_url gclid_from_cookie gclid_from_url 1 ... c1574680126; gclid=EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; full_pat ... pn/?gclid=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE 2 ... c1574680126; gclid=EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; full_pat ... to/?click_type=gclid&click_id=CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE&click_ EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE ... gclid=下设置的。所以我不确定当值可能存在于这些字符串之一中或根本不存在时如何分割字符串。

当我尝试分割字符串时,出现错误click_id=

非常感谢您的帮助。

2 个答案:

答案 0 :(得分:0)

  1. 创建数据框后最好这样做。
  2. 您不能直接在pd.Series上使用字符串操作,而必须将其转换为str:

    df['str_col'].str.split(':')
    

例如:
假设您有一个这样的数据框:

data = {'Name':['Tom:bar', 'nick:bar', 'krish:bar', 'jack:bar'], 'Age':[20, 21, 19, 18]} 

# Create DataFrame 
df = pd.DataFrame(data) 
print(df)
[Out]:
        Name   Age
0    Tom:bar   20
1   nick:bar   21
2  krish:bar   19
3   jack:bar   18

您可以使用以下操作创建新的列:

df['bar_col'] = [x.split(':')[1] for x in df.Name]
print(df)
[Out]:
        Name  Age  bar_col
0    Tom:bar   20  bar
1   nick:bar   21  bar
2  krish:bar   19  bar
3   jack:bar   18  bar

答案 1 :(得分:0)

数据帧仍然很难读取,但是使用以下示例:

df = pd.DataFrame({'_source.request_url': ['https://google.com/au/?gclid=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE', 'https://google.com/au/?click_id=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE', 'no match example'], 
                   '_source.cookie': ['__cfduid=d118f225fac35345d9e1d87e533b596ec1574680126; gclid=EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE;', '__cfduid=d118f225fac35345d9e1d87e533b596ec1574680126; gclid=EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE;', None]})

要提取=和;之间的字符串,可以使用正则表达式模式r'=(.+?);'

import re

def get_glid_from_source(pattern, data):

    result = re.search(pattern, str(data))
    if result is not None:
        return result.group(1)
    return None

df['glid_from_url'] = df.apply(lambda x: get_glid_from_source('[gclid|click_id]=(.+?)$', x['_source.request_url']), axis=1)
df['gclid_from_cookie'] = df.apply(lambda x: get_glid_from_source('gclid=(.+?)[;%&]', x['_source.cookie']), axis=1)

如果数据中没有匹配项,则正则表达式将返回None,因此您必须使用if result is not None进行捕获。

输出数据帧为:

    _source.request_url _source.cookie                  glid_from_url                                                                                           gclid_from_cookie
0   https://google.com/au/?gclid=CjwKCAiAlO7uBRANE...   __cfduid=d118f225fac35345d9e1d87e533b596ec1574...   CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lH...   EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEg...
1   https://google.com/au/?click_id=CjwKCAiAlO7uBR...   __cfduid=d118f225fac35345d9e1d87e533b596ec1574...   CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lH...   EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEg...
2   no match example                                    None                                                None                                                None

如果数据中只有一个匹配项,如果有多个匹配项,并且您想捕获该匹配项,请使用re.findall(pattern, data)