R的原始数据清理

时间:2016-09-08 00:00:51

标签: r ndjson

我有一个包含在线日志数据的原始和低级文本数据文件。 我需要安排这些低级原始数据并将排列的数据导出到.csv文件中。

示例原始数据如下所示。在数据中,evendid是列名,0f3f98c7-1cee-4c1a-bc9219b是其字段值。以同样的方式,visitiorid也是一个列名," 01546981644d001e0f99d341182e"是它的字段值。因此,我们可以看到,列名和字段值由:(冒号)分隔,每列由,(逗号)分隔。并且通过开始大括号{,并以结束大括号结束}来启动一条记录。另外,每行/记录包含120到167列的值。但是有些列可能包含空值。所以,我想编写一个程序来安排/清理.txt文件中的这些数据并写入.csv文件。任何想法和支持都会得到高度赞赏。

{ "eventid" : "0f3f98c7-1cee-4c1a-bad9-c5d772c9219b" , "visitorid" : "015469816482e00095002e08d007f0" , "eventtime" : 1462059242000 , "useragent" : "Mozilla/5.0 (Linux; U; Android 4.2.2; ca-ca; SonySO-04E Build/10.3.1.B.0.224) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30" , "pageurl_full_url" : "https://www.abcdefg.com/auto/v22/renewal/calculatePremium.html" , "pageurl_scheme" : "https" , "pageurl_domain" : "www.abcdefg.com" , "pageurl_path" : "/auto/abc22/renewal/calculatePremium.html" , "referrerurl_full_url" : "https://www.abcdefg.com.jp/auto/v22/renewal/calculatePremium.html" , "referrerurl_scheme" : "https" , "abcdefg_domain" : "www.abcdefg.com" , "referrerurl_path" : "/auto/abc22/renewal/calculatePremium.html" , "tags_main_4_executed" : true , "tags_main_16_executed" : true , "dom_title" : "super Car Ins [quote results For Renew]" , "dom_referrer" : "" , "dom_hash" : "" , "dom_domain" : "www.abcdefg.com" , "dom_viewport_width" : 720 , "dom_viewport_height" : 1030 , "dom_pathname" : "/auto/abc22/renewal/calculatePremium.html" , "dom_query_string" : "" , "dom_url" : "https://www.abcdefg.com/auto/abc22/renewal/calculatePremium.html" , "udo_quote_date" : "2016.5.1" , "udo_page_url" : "https://www.abcdefg.com/auto/abc22/renewal/calculatePremium.html" , "udo_ut_version" : "ut4.38.201604270626" , "udo_prod_id" : "ACD" , "udo_quote_expiry_date" : "2017.05.14" , "udo_quote_prev_expiry_date" : "2017.05.14" , "udo_ut_account" : "abcdefg-india" , "udo_page_cat" : "Product" , "udo_contract_paytype" : "" , "udo_quote_amt" : "71290,71690,72080" , "udo_quote_id" : "175545859609000,175545859609000,175545859609000" , "udo_client_id" : "911324977090000" , "udo_renewal_times" : "4" , "udo_prod_name" : "Renewal" , "udo_ut_profile" : "main" , "udo_device_id" : "Mobile" , "udo_contract_id" : "26771063" , "udo_ut_event" : "view" , "udo_ut_env" : "prod" , "udo__t_session_id" : "1462058968141" , "udo__t_visitor_id" : "01546981644d001e0f99d341182e00095002e08d007f0" , "udo_page_type" : "Quote" , "udo_ut_domain" : "abcdefg.com" , "udo_contract_amt" : "" , "udo_page_name" : "super car insurance quote [quote results For Renew]" , "udo_quote_pre_ins_company" : "abcInsCompany" , "js_timestamp" : "2016-04-30T23:34:02.612Z" , "firstpartycookies_utag_main_dc_event" : "15" , "firstpartycookies_utag_main__ss" : "0" , "firstpartycookies_sc_status" : "8" , "firstpartycookies_utag_quote_date" : "2016.5.1" , "firstpartycookies_utag_contract_id" : "26771063" , "firstpartycookies_utag_main__st" : "1462061042574" , "firstpartycookies_utag_main_dc_visit" : "1" , "firstpartycookies_utag_page_cat" : "Product" , "firstpartycookies_utag_prod_id" : "ACD" , "firstpartycookies_utag_renewal_times" : "4" , "firstpartycookies_utag_quote_expiry_date" : "2017.05.14" , "firstpartycookies_utag_page_type" : "Quote" , "firstpartycookies_uniqueid" : "954704708970000" , "firstpartycookies_sc_asp_net_sessionid" : "ih3l00llymb2ml4dwzzywdrv" , "firstpartycookies__gat_tealium_0" : "1" , "firstpartycookies_utag_main_v_id" : "01546981644d001e0f99d341182e00095002e08d007f0" , "firstpartycookies_session" : "MTAuNjguMS4zMg==" , "firstpartycookies_utag_prod_name" : "Renewal" , "firstpartycookies_utag_quote_id" : "175545859609000,175545859609000,175545859609000" , "firstpartycookies_utag_main_ses_id" : "1462058968141" , "firstpartycookies__ga" : "GA1.3.987054064.1462058970" , "firstpartycookies_utag_main__sn" : "1" , "firstpartycookies___gyr_casted_frames" : "L_E_2170_A39,L_E_1985_A39,L_E_1881_A39" , "firstpartycookies_jsessionid" : "0001oiKCaYAKZYSnX6JqmhIrOo0:15fkukn73" , "firstpartycookies_utag_main__pn" : "8" , "firstpartycookies_utag_quote_amt" : "71290,71690,72080" , "firstpartycookies___gyr_uuid" : "7e4f3633-0eff-4e0e-98b6-c8c544cbb375" , "firstpartycookies___gyr_sid" : "2ce7772f-0c95-455f-afc4-241a5541c47c" , "firstpartycookies_utag_quote_prev_expiry_date" : "2017.05.14" , "firstpartycookies_utag_client_id" : "911324977090000" , "firstpartycookies_utag_quote_pre_ins_company" : "abcInsuranceCompany" , "firstpartycookies_utag_device_id" : "Mobile" , "firstpartycookies___gyr_cmpcnts" : "L_E_2170_A39:[825:1],L_E_1985_A39:[1011:1],L_E_1881_A39:[1022:1]" , "firstpartycookies___gyr_depid" : "14326,17172,17292" , "firstpartycookies___gyr_rule_id_myabcdefg" : "1079"}
{ "eventid" : "f8c8beac-d8ce-4930-956e-79c6120aea65" , "visitorid" : "0154698511eb0019161e632df605020a9007a0a100bd0" , "eventtime" : 1462059246000 , "useragent" : "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko" , "pageurl_full_url" : "https://www.abcdefg.com/" , "pageurl_scheme" : "https" , "pageurl_domain" : "www.abcdefg.com" , "pageurl_path" : "/" , "referrerurl_full_url" : "https://www.abcdefg.com/" , "referrerurl_scheme" : "https" , "referrerurl_domain" : "www.abcdefg.com" , "referrerurl_path" : "/" , "tags_main_4_executed" : true , "tags_main_15_executed" : true , "tags_main_16_executed" : true , "tags_main_61_executed" : true , "dom_title" : "【Hoken】Adv Site|Auto Insurance・care of" , "dom_referrer" : "http://search.abcdefg.com/search;_ylt=A3xTqFmsQCVXu08AhwiJBtF7?p=%E3%83%81%E3%83%A5%E3%83%BC%E3%83%AA%E3%83%83%E3%83%92&search.x=1&fr=top_ga1_sa&tid=top_ga1_sa&ei=UTF-8&aq=0&oq=%E3%81%A1%E3%82%85%E3%81%86%E3%82%8A%E3%81%A3%E3%81%B2&afs=" , "dom_hash" : "" , "dom_domain" : "www.abcdefg.com" , "dom_viewport_width" : 1912 , "dom_viewport_height" : 955 ,

1 个答案:

答案 0 :(得分:2)

ndjson包可以处理这个问题。我从其中三个记录中创建了一个文件,但如果各行中缺少列,则会使它们NA

library(dtplyr)
library(ndjson)

glimpse(stream_in("so.json"))

我无法显示输出,因为StackOverflow不够亮,不能将其识别为垃圾邮件。

你也可以使用较慢的jsonlite::stream_in(),但你必须自己压平记录。