我有一个大的CSV文件(1.7GB - 我相信大约400万行)。该文件是Cisco IronPort针对某个范围的所有流量的转储。我的最终目标是将文本导入SQL / Access或其中一个数据建模应用程序,以便能够显示文件中唯一ID的浏览习惯(实际上是2个文件)。
导入到SQL后,它会弹出,因为其中一个网址中有一个逗号。我的想法是尝试重写URL列以转储TLD之后的所有内容(foo.com/blah,tracking?ref=!superuselessstuff到foo.com)。
同事为PowerShell提出了以下两个代码。第一个工作得很好,但1.7G文件拖动我的系统爬行,它从未完成(运行48小时没有完成)。第二个完成,但使文本更难以使用。救命?
源数据示例:
"Begin Date"|"End Date"|"Time (GMT -05:00)"|"URL"|"CONTENT TYPE"|"URL CATEGORY"|"DESTINATION IP"|"Disposition"|"Policy Name"|"Policy Type"|"Application Type"|"User"|"User Type"
"2013-07-24 05:00 GMT"|"2013-07-25 04:59 GMT"|"1374728377"|"hxxp://mediadownloads.mlb.com/mlbam/2013/06/23/mlbtv_bosdet_28278793_1800K.mp4"|"video/mp4"|"Sports and Recreation"|"165.254.94.168"|"Allow"|"Generics"|"Access"|"Media"|"DOMAIN\gen7@Domain 10.XXX.XXX.XXX"|"[-]"
"2013-07-24 05:00 GMT"|"2013-07-25 04:59 GMT"|"1374728376"|"hxxp://stats.pandora.com/v1?callback=jQuery17102006296486278092_1374683921429&type=promo_box&action=auto_scroll&source=PromoBoxView&listener_id=84313100&_=1374728377192"|"text/javascript"|"Streaming Audio"|"208.85.40.44"|"Allow"|"Generics"|"Access"|"Media"|"DOMAIN\gen7@Domain 10.XXX.XXX.XXX"|"[-]"
"2013-07-24 05:00 GMT"|"2013-07-25 04:59 GMT"|"1374728357"|"hxxp://b.scorecardresearch.com/p?c1=1&c2=3005352&c3=&c4=mlb&c5=02&c6=&c10=&c7=hxxp%3A//wapc.mlb.com/det/play/%3Fcontent_id%3D29083605%26topic_id%3D8878748%26c_id%3Ddet&c8=Video%3A%20Recap%3A%20BOS%203%2C%20DET%2010%20%7C%20MLB.com%20Multimedia&c9=hxxp%3A//detroit.tigers.mlb.com/index.jsp%3Fc_id%3Ddet&rn=0.36919005215168&cv=2.0"|"image/gif"|"Business and Industry"|"207.152.125.91"|"Allow"|"Generics"|"Access"|"-"|"DOMAIN\gen7@Domain 10.XXX.XXX.XXX"|"[-]"
"2013-07-24 05:00 GMT"|"2013-07-25 04:59 GMT"|"1374728356"|"hxxp://lt150.tritondigital.com/lt?guid=VEQyNX4wMmIzY2FmZi1mMmExLTQ5OWQtODM5NS1kMjE0ZTkwMzMyMTY%3D&yob=1978&gender=M&zip=55421&hasads=0&devcat=WEB&devtype=WEB&cb=13747283558794766"|"text/plain"|"Business and Industry"|"208.92.52.90"|"Allow"|"Generics"|"Access"|"-"|"DOMAIN\GEN1@Domain 10.XXX.XXX.XXX"|"[-]"
"2013-07-24 05:00 GMT"|"2013-07-25 04:59 GMT"|"1374728356"|"""hxxp://an.mlb.com/b/ss/mlbglobal08,mlbtigers/1/H.26/s93606666143392?AQB=1&ndh=1&t=24%2F6%2F2013%2023%3A59%3A17%203%20300&fid=0DDFB0A0676D5241-080519A2C0D076F2&ce=UTF-8&ns=mlb&pageName=Major%20League%20Baseball%3A%20Multimedia%3A%20Video%20Playback%20Page&g=hxxp%3A%2F%2Fwapc.mlb.com%2Fdet%2Fplay%2F%3Fcontent_id%3D29083605%26topic_id%3D8878748%26c_id%3Ddet&cc=USD&events=event2%2Cevent28%2Cevent4&v13=Video%20Playback%20Page&c24=mlbglobal08%2Cmlbtigers&v28=28307515%7CFLASH_1200K_640X360&c49=mlb.mlb.com&v49=mlb.mlb.com&pe=lnk_o&pev1=hxxp%3A%2F%2FmyGenericURL&pev2=VPP%20Game%20Recaps&s=1440x900&c=32&j=1.6&v=Y&k=Y&bw=1440&bh=719&AQE=1"""|"image/gif"|"Sports and Recreation"|"66.235.133.11"|"Allow"|"Generics"|"Access"|"-"|"DOMAIN\gen7@Domain 10.XXX.XXX.XXX"|"[-]"
"2013-07-24 05:00 GMT"|"2013-07-25 04:59 GMT"|"1374728356"|"hxxp://ad.auditude.com/adserver/e?type=podprogress&br=4&z=50389&u=e91d539c7acb7daed69ab3fcdb2a4ea0&pod=id%3A4%2Cctype%3Al%2Cptype%3At%2Cdur%3A200%2Clot%3A5%2Cedur%3A0%2Celot%3A0%2Ccpos%3A3&advancepattern=1&l=1374710168&cid=1922976207&event=complete&uid=RzsxnCYcRkiQ6p9YxyRdEQ&s=e9c06908&t=1374728168"|"-"|"Advertisements"|"63.140.50.240"|"Allow"|"Generics"|"Access"|"-"|"DOMAIN\gen7@Domain 10.XXX.XXX.XXX"|"[-]"
第一个咀嚼资源但是按照希望吐出来的代码是:
$filename = 'Dump.csv'
$csv = Import-csv $filename -Delimiter '|'
$csv | foreach {
$url = $_.URL
$_.URL = $url -replace '^\"*(\w*)://([^/]*)/(.*)$','$1://$2'
} $csv | Export-Csv 'DumpParsed.csv'
像这样吐出来:
"Begin Date","End Date","Time (GMT -05:00)","URL","CONTENT TYPE","URL CATEGORY","DESTINATION IP","Disposition","Policy Name","Policy Type","Application Type","User","User Type"
"2013-07-24 05:00 GMT","2013-07-25 04:59 GMT","1374728377","hxxp://mediadownloads.mlb.com","video/mp4","Sports and Recreation","165.254.94.168","Allow","Generics","Access","Media","DOMAIN\gen7@Domain 10.XXX.XXX.XXX","[-]"
"2013-07-24 05:00 GMT","2013-07-25 04:59 GMT","1374728376","hxxp://stats.pandora.com","text/javascript","Streaming Audio","208.85.40.44","Allow","Generics","Access","Media","DOMAIN\gen7@Domain 10.XXX.XXX.XXX","[-]"
"2013-07-24 05:00 GMT","2013-07-25 04:59 GMT","1374728357","hxxp://b.scorecardresearch.com","image/gif","Business and Industry","207.152.125.91","Allow","Generics","Access","-","DOMAIN\gen7@Domain 10.XXX.XXX.XXX","[-]"
"2013-07-24 05:00 GMT","2013-07-25 04:59 GMT","1374728356","hxxp://lt150.tritondigital.com","text/plain","Business and Industry","208.92.52.90","Allow","Generics","Access","-","DOMAIN\GEN1@Domain 10.XXX.XXX.XXX","[-]"
"2013-07-24 05:00 GMT","2013-07-25 04:59 GMT","1374728356","hxxp://an.mlb.com","image/gif","Sports and Recreation","66.235.133.11","Allow","Generics","Access","-","DOMAIN\gen7@Domain 10.XXX.XXX.XXX","[-]"
"2013-07-24 05:00 GMT","2013-07-25 04:59 GMT","1374728356","hxxp://ad.auditude.com","-","Advertisements","63.140.50.240","Allow","Generics","Access","-","DOMAIN\gen7@Domain 10.XXX.XXX.XXX","[-]"
第二个代码的工作速度明显加快,但是吐出格式错误的数据,这是SQL不喜欢的。
$filename = 'Dump.csv'
Import-csv $filename -Delimiter '|' | foreach {
$_.URL = $_.URL -replace '^\"*(\w*)://([^/]*)/(.*)$','$1://$2'
Add-Content 'DumpParsed.csv' "$_"
}
输出不太好:
@{Begin Date=2013-07-24 05:00 GMT; End Date=2013-07-25 04:59 GMT; Time (GMT -05:00)=1374728377; URL=hxxp://mediadownloads.mlb.com; CONTENT TYPE=video/mp4; URL CATEGORY=Sports and Recreation; DESTINATION IP=165.254.94.168; Disposition=Allow; Policy Name=Generics; Policy Type=Access; Application Type=Media; User=DOMAIN\gen7@Domain 10.XXX.XXX.XXX; User Type=[-]}
@{Begin Date=2013-07-24 05:00 GMT; End Date=2013-07-25 04:59 GMT; Time (GMT -05:00)=1374728357; URL=hxxp://b.scorecardresearch.com; CONTENT TYPE=image/gif; URL CATEGORY=Business and Industry; DESTINATION IP=207.152.125.91; Disposition=Allow; Policy Name=Generics; Policy Type=Access; Application Type=-; User=DOMAIN\gen7@Domain 10.XXX.XXX.XXX; User Type=[-]}
@{Begin Date=2013-07-24 05:00 GMT; End Date=2013-07-25 04:59 GMT; Time (GMT -05:00)=1374728356; URL=hxxp://lt150.tritondigital.com; CONTENT TYPE=text/plain; URL CATEGORY=Business and Industry; DESTINATION IP=208.92.52.90; Disposition=Allow; Policy Name=Generics; Policy Type=Access; Application Type=-; User=DOMAIN\GEN1@Domain 10.XXX.XXX.XXX; User Type=[-]}
@{Begin Date=2013-07-24 05:00 GMT; End Date=2013-07-25 04:59 GMT; Time (GMT -05:00)=1374728356; URL=hxxp://an.mlb.com; CONTENT TYPE=image/gif; URL CATEGORY=Sports and Recreation; DESTINATION IP=66.235.133.11; Disposition=Allow; Policy Name=Generics; Policy Type=Access; Application Type=-; User=DOMAIN\gen7@Domain 10.XXX.XXX.XXX; User Type=[-]}
@{Begin Date=2013-07-24 05:00 GMT; End Date=2013-07-25 04:59 GMT; Time (GMT -05:00)=1374728356; URL=hxxp://ad.auditude.com; CONTENT TYPE=-; URL CATEGORY=Advertisements; DESTINATION IP=63.140.50.240; Disposition=Allow; Policy Name=Generics; Policy Type=Access; Application Type=-; User=DOMAIN\gen7@Domain 10.XXX.XXX.XXX; User Type=[-]}
还有其他想法吗?我知道一些powershell和一点sql。但我对其他任何事情持开放态度。
答案 0 :(得分:1)
您的第二个解决方案工作得更快,因为它没有将所有文件都放在内存中。你可以尝试改变它:
$filename = 'Dump.csv'
Import-csv $filename -Delimiter '|' | foreach { $_.URL = $_.URL -replace '^\"*(\w*)://([^/]*)/(.*)$','$1://$2'; $_ } |export-csv 'DumpParsed.csv'
答案 1 :(得分:1)
首先,如果你这样做:
$csv = Import-csv $filename -Delimiter '|'
将整个文件作为从字段构造的对象加载到内存中。因此,没有惊人的内存消耗和性能是一个问题。第二种方法不是太糟糕,但它应该以CSV格式转储。就像现在一样,它会转储它创建的对象的内容。你可以试试这个:
$filename = 'Dump.csv'
Import-csv $filename -Delimiter '|' |
Foreach {$_.URL = $_.URL -replace '^\"*(\w*)://([^/]*)/(.*)$','$1://$2'} |
ConvertTo-Csv -NoTypeInfo | Out-File DumpParsed.csv -Enc UTF8 -Append
顺便说一句,看看跳过CSV处理是否会显着提高速度或者不会加速这一点会很有趣。
Get-Content $filename | Foreach {$_ -replace '\"*(\w*)://([^/]*)/[^"]*"(.*)','$1://$2"$3'} |
Out-File DumpParsed.csv -Enc UTF8
我只是在猜测日志文件的原始编码。它很可能是ASCII。
答案 2 :(得分:1)
您是否尝试过为您的输出使用流式编写器?而不是像csv一样导入文件,只是逐行浏览?像这样:
$filename = "Dump.csv"
$out = "C:\path\to\out-file.csv" # full path required here
$stream = [System.IO.StreamWriter] $out
Get-Content $filename `
| % {
$line = $_ -replace '\"+(\w*)://([^/]*)/(.*?)\"+','"$1://$2"'
$stream.WriteLine($line)
}
$stream.close()
如果您要导入SQL服务器,那么可以将TextQualified字段设置为true,它会将引号内的所有内容视为字符串,包括额外的逗号。
答案 3 :(得分:0)
如果您的数据库在逗号上导入chokes,那么只是替换该逗号是不是一个选项?像这样:
Get-Content 'Dump.csv' | % { $_ -replace ',','%2C' } | Out-File 'DumpParsed.csv'
或者像这样(如果其他字段包含您要保留的文字逗号):
Import-Csv 'Dump.csv' -Delimiter '|' `
| % { $_.URL = $_.URL -replace ',','%2C' } `
| Export-Csv 'DumpParsed.csv' -Delimiter '|'
%2C
是网址中逗号的编码。