在Ruby中剪切和Greping来自CSV的信息

时间:2013-08-10 11:42:44

标签: ruby bash parsing

我有一个巨大的.csv文件,其中包含以下标题:

时间戳,URL,IP

在网址请求中嵌入了需要提取的Youtube视频ID标识符。

输入

"26 Jul 2013 00:01:01 UTC","http://r2---sn-nwj7km7e.c.youtube.com/videoplayback?algorithm=throttle-factor&burst=40&clen=255192903&cp=U0hWSVhMUV9GTUNONl9QRlVHOlBSTXhMQ2FtRVRy&cpn=lwn6qrn2_oDOCQl_&dur=4259.840&expire=1374813613&factor=1.25&fexp=900223%2C912307%2C911419%2C932217%2C914028%2C916624%2C919515%2C909546%2C929117%2C929121%2C929906%2C929907%2C925720%2C925722%2C925718%2C925714%2C929917%2C929919%2C912521%2C904830%2C919373%2C904122%2C919387%2C936303%2C909549%2C900816%2C936301%2C912711%2C935000&gcr=in&gir=yes&id=10ff11582e78027b&ip=132.93.92.117&ipbits=8&itag=134&keepalive=yes&key=yt1&lmt=1368924664324037&ms=au&mt=1374793074&mv=m&nh=EAI&range=143196160-144138239&ratebypass=yes&signature=78B2B03AFE619C43E61B30AC228B9C33990B2D89.CADEA7BA4F49AF7C0CB9D6A0C7E4EB277AA338F2&source=youtube&sparams=algorithm%2Cburst%2Cclen%2Ccp%2Cdur%2Cfactor%2Cgcr%2Cgir%2Cid%2Cip%2Cipbits%2Citag%2Clmt%2Csource%2Cupn%2Cexpire&sver=3&upn=S4gwbSmbOGM","192.168.101.2",
"26 Jul 2013 00:02:31 UTC","http://www.youtube.com/watch?v=3hSSRHJYHVY",192.168.101.6"
"26 Jul 2013 00:02:34 UTC","http://www.youtube.com/player_204?ei=lrzxUberMOq_kwLnsoGwDQ&plid=AATiXtvkD53nSs3J&fv=WIN%2011,6,602,180&l_ns=1&len=138&l_state=3&fmt=134&lact=1598&slots=sst~0;sidx~0;at~1_3&ad_flags=1&event=ad&cid=7317&el=detailpage&art=2.24&mt=0&fexp=933900,901439,924368,914070,916612,929305,909546,929117,929121,929906,929907,925720,925722,925718,925714,929917,929919,912521,904830,919373,904122,932216,908534,919387,936303,909549,900816,936301,912711,935000&sidx=0&scoville=1&ad_event=3&sst=0&allowed=1_2,1_2_1,1_1,1_3&v=3hSSRHJYHVY&ad_sys=GDFP&rt=1.002&ns=yt&cpn=-gf8Awba9stlT85b&at=1_3&ad_id=16345549","192.168.101.9"
"26 Jul 2013 00:09:02 UTC","http://www.youtube.com/watch?v=e3oP5NtjlEQ","192.168.101.7",

我几乎可以在bash中实现这一点,但我想在ruby中做到这一点(仍然在学习)。

cut -d , -f 2 urls.csv | grep watch?v=

输出

"http://www.youtube.com/watch?v=chzEn7TmzJA"
"http://www.youtube.com/watch?v=wAVl_IJV5eI&list=PL34B86ECEC1703D6F"
"http://www.youtube.com/watch?v=8t2s9HSrkl8&list=PL34B86ECEC1703D6F"
"http://www.youtube.com/watch?v=ssdqClUH00c"
"http://www.youtube.com/watch?v=nLIH9cA-Ftg&feature=c4-overview-vl&list=PL1Gpi18n3tsp1GkZ9h4kKKoiJmOSyWpc4"

Youtube视频ID标识符基本上是观看后的11个字符?= 直到第一个&

感谢。

更新

require 'csv'
require 'addressable/uri'

#read lines from csv, headers on
lines = CSV.readlines("test.csv", :headers=>true)

#print csv column with headers 'Date and Time and 'Url'
#p lines ['Date and Time']
#p lines['Url']
#timestamp = lines ['Date and Time']
urls = lines['Url']

# for each line (url) query value
urls.each do |url|
  v = Addressable::URI.parse(url).query_values["v"]
  if (v)
     puts v # prints value if found
  end
end

上面的代码会输出所有请求中包含的视频ID,而不是 watch?v = ,因此有很多重复项。

如何让它只输出前缀为 watch?v = 的视频? (带时间戳和IP)。这表示视频实际上已播放。感谢。

2 个答案:

答案 0 :(得分:1)

切片和切块uri的支持在ruby的核心uri类中受到限制。另一个选项是addressable/uri

require 'addressable/uri'
uri=Addressable::URI.parse('http://www.youtube.com/watch?v=nLIH9cA-Ftg&feature=c4-overview-vl&list=PL1Gpi18n3tsp1GkZ9h4kKKoiJmOSyWpc4')
uri.query_values["v"] #query_values returns key-value pairs of query components
=> "nLIH9cA-Ftg"

这是一个片段

urls=["http://www.youtube.com/watch?v=chzEn7TmzJA", "http://www.youtube.com/watch?v=wAVl_IJV5eI&list=PL34B86ECEC1703D6F", "http://www.youtube.com/watch?v=8t2s9HSrkl8&list=PL34B86ECEC1703D6F", "http://www.youtube.com/watch?v=ssdqClUH00c", "http://www.youtube.com/watch?v=nLIH9cA-Ftg&feature=c4-overview-vl&list=PL1Gpi18n3tsp1GkZ9h4kKKoiJmOSyWpc4"]

urls.each do |url|
  v = Addressable::URI.parse(url).query_values["v"]
  puts v
end

返回

chzEn7TmzJA
wAVl_IJV5eI
8t2s9HSrkl8
ssdqClUH00c
nLIH9cA-Ftg

您可以addressable/uri

获得sudo gem install addressable

答案 1 :(得分:0)

在轨道上的红宝石中:

你可以试试这个:

 require 'csv'
 lines = CSV.readlines("path to csv file)

然后你可以在这些行上进行迭代:

lines.each |row| do
 url_parameters = lines[n]  # where n should be the position of column in csv
 uri = URI.parse(url_parameters)
 uri_params = CGI.parse(uri.query)
 video_code = uri_params['v'].first

 # this is the video code of the youtube url : You can do whatever is the requirement

end