我有使用API调用构建的数据框。我将API调用120次,并获得1000x31数据集,每次调用API时都会追加该数据集。
def load_full2(times):
dfs = []
item_count = 0
while item_count <= times:
response = requests.post(url_2,data=json.dumps(data_two),headers=headers)
response_json = response.json()
result = pd.io.json.json_normalize(response_json['hits']['hits'])
item_count+=1
dfs.append(result)
df = pd.concat(dfs, ignore_index=True)
df.to_csv("export2.csv", encoding='utf-8', index=False)
我导出的最终数据集如下:
120000x31
id _index _score _source.agent _source.cookie .source.id _source.log _source.keys _source.name _source.category _source.class _source.companyid _source.cname _source.ip _source.method _source.process _source.skid _source.severity _source.sysname _source.template _source.time _source.country _source.event _source.hostname _source.ipip _source.namespace _source.refer _source.request_url _source.type
n/a n/a n/a n/a __cfduid=d118f225fac35345d9e1d87e533b596ec1574680126; gclid=EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; full_path=google.com n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a https://google.com/au/?gclid=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE
我的主要兴趣是“ _source.cookie”和“ _source.request_url”列。我的目标是将2个新列添加到我的数据集中,第一个是Gclid from cookie
,该列将保存 gclid = 之后的值,该值以结尾; 。第二列将是Glid_from_url
,它将保留在 gclid = 或 click_id =
我想要的输出看起来像这样:
120000x33
_service.request_url
我对编程还很陌生,不确定我将如何前进以及如何对其进行编码。我会在每次构建文件的过程中尝试拆分,同时循环从我感兴趣的2列中拆分字符串吗?还是在编译完整文件后再这样做?
第二个问题是,在 _source.request_url 列中,该值是在id ... _source.cookie ... _source.request_url gclid_from_cookie gclid_from_url
1 ... c1574680126; gclid=EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; full_pat ... pn/?gclid=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE
2 ... c1574680126; gclid=EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; full_pat ... to/?click_type=gclid&click_id=CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE&click_ EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE
...
或gclid=
下设置的。所以我不确定当值可能存在于这些字符串之一中或根本不存在时如何分割字符串。
当我尝试分割字符串时,出现错误click_id=
非常感谢您的帮助。
答案 0 :(得分:0)
您不能直接在pd.Series
上使用字符串操作,而必须将其转换为str:
df['str_col'].str.split(':')
例如:
假设您有一个这样的数据框:
data = {'Name':['Tom:bar', 'nick:bar', 'krish:bar', 'jack:bar'], 'Age':[20, 21, 19, 18]}
# Create DataFrame
df = pd.DataFrame(data)
print(df)
[Out]:
Name Age
0 Tom:bar 20
1 nick:bar 21
2 krish:bar 19
3 jack:bar 18
您可以使用以下操作创建新的列:
df['bar_col'] = [x.split(':')[1] for x in df.Name]
print(df)
[Out]:
Name Age bar_col
0 Tom:bar 20 bar
1 nick:bar 21 bar
2 krish:bar 19 bar
3 jack:bar 18 bar
答案 1 :(得分:0)
数据帧仍然很难读取,但是使用以下示例:
df = pd.DataFrame({'_source.request_url': ['https://google.com/au/?gclid=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE', 'https://google.com/au/?click_id=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE', 'no match example'],
'_source.cookie': ['__cfduid=d118f225fac35345d9e1d87e533b596ec1574680126; gclid=EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE;', '__cfduid=d118f225fac35345d9e1d87e533b596ec1574680126; gclid=EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE;', None]})
要提取=和;之间的字符串,可以使用正则表达式模式r'=(.+?);'
。
import re
def get_glid_from_source(pattern, data):
result = re.search(pattern, str(data))
if result is not None:
return result.group(1)
return None
df['glid_from_url'] = df.apply(lambda x: get_glid_from_source('[gclid|click_id]=(.+?)$', x['_source.request_url']), axis=1)
df['gclid_from_cookie'] = df.apply(lambda x: get_glid_from_source('gclid=(.+?)[;%&]', x['_source.cookie']), axis=1)
如果数据中没有匹配项,则正则表达式将返回None,因此您必须使用if result is not None
进行捕获。
输出数据帧为:
_source.request_url _source.cookie glid_from_url gclid_from_cookie
0 https://google.com/au/?gclid=CjwKCAiAlO7uBRANE... __cfduid=d118f225fac35345d9e1d87e533b596ec1574... CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lH... EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEg...
1 https://google.com/au/?click_id=CjwKCAiAlO7uBR... __cfduid=d118f225fac35345d9e1d87e533b596ec1574... CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lH... EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEg...
2 no match example None None None
如果数据中只有一个匹配项,如果有多个匹配项,并且您想捕获该匹配项,请使用re.findall(pattern, data)
。