我有一个巨大的.csv文件,其中包含以下标题:
时间戳,URL,IP
在网址请求中嵌入了需要提取的Youtube视频ID标识符。
输入
"26 Jul 2013 00:01:01 UTC","http://r2---sn-nwj7km7e.c.youtube.com/videoplayback?algorithm=throttle-factor&burst=40&clen=255192903&cp=U0hWSVhMUV9GTUNONl9QRlVHOlBSTXhMQ2FtRVRy&cpn=lwn6qrn2_oDOCQl_&dur=4259.840&expire=1374813613&factor=1.25&fexp=900223%2C912307%2C911419%2C932217%2C914028%2C916624%2C919515%2C909546%2C929117%2C929121%2C929906%2C929907%2C925720%2C925722%2C925718%2C925714%2C929917%2C929919%2C912521%2C904830%2C919373%2C904122%2C919387%2C936303%2C909549%2C900816%2C936301%2C912711%2C935000&gcr=in&gir=yes&id=10ff11582e78027b&ip=132.93.92.117&ipbits=8&itag=134&keepalive=yes&key=yt1&lmt=1368924664324037&ms=au&mt=1374793074&mv=m&nh=EAI&range=143196160-144138239&ratebypass=yes&signature=78B2B03AFE619C43E61B30AC228B9C33990B2D89.CADEA7BA4F49AF7C0CB9D6A0C7E4EB277AA338F2&source=youtube&sparams=algorithm%2Cburst%2Cclen%2Ccp%2Cdur%2Cfactor%2Cgcr%2Cgir%2Cid%2Cip%2Cipbits%2Citag%2Clmt%2Csource%2Cupn%2Cexpire&sver=3&upn=S4gwbSmbOGM","192.168.101.2",
"26 Jul 2013 00:02:31 UTC","http://www.youtube.com/watch?v=3hSSRHJYHVY",192.168.101.6"
"26 Jul 2013 00:02:34 UTC","http://www.youtube.com/player_204?ei=lrzxUberMOq_kwLnsoGwDQ&plid=AATiXtvkD53nSs3J&fv=WIN%2011,6,602,180&l_ns=1&len=138&l_state=3&fmt=134&lact=1598&slots=sst~0;sidx~0;at~1_3&ad_flags=1&event=ad&cid=7317&el=detailpage&art=2.24&mt=0&fexp=933900,901439,924368,914070,916612,929305,909546,929117,929121,929906,929907,925720,925722,925718,925714,929917,929919,912521,904830,919373,904122,932216,908534,919387,936303,909549,900816,936301,912711,935000&sidx=0&scoville=1&ad_event=3&sst=0&allowed=1_2,1_2_1,1_1,1_3&v=3hSSRHJYHVY&ad_sys=GDFP&rt=1.002&ns=yt&cpn=-gf8Awba9stlT85b&at=1_3&ad_id=16345549","192.168.101.9"
"26 Jul 2013 00:09:02 UTC","http://www.youtube.com/watch?v=e3oP5NtjlEQ","192.168.101.7",
我几乎可以在bash中实现这一点,但我想在ruby中做到这一点(仍然在学习)。
cut -d , -f 2 urls.csv | grep watch?v=
输出
"http://www.youtube.com/watch?v=chzEn7TmzJA"
"http://www.youtube.com/watch?v=wAVl_IJV5eI&list=PL34B86ECEC1703D6F"
"http://www.youtube.com/watch?v=8t2s9HSrkl8&list=PL34B86ECEC1703D6F"
"http://www.youtube.com/watch?v=ssdqClUH00c"
"http://www.youtube.com/watch?v=nLIH9cA-Ftg&feature=c4-overview-vl&list=PL1Gpi18n3tsp1GkZ9h4kKKoiJmOSyWpc4"
Youtube视频ID标识符基本上是观看后的11个字符?= 直到第一个&
感谢。
更新
require 'csv'
require 'addressable/uri'
#read lines from csv, headers on
lines = CSV.readlines("test.csv", :headers=>true)
#print csv column with headers 'Date and Time and 'Url'
#p lines ['Date and Time']
#p lines['Url']
#timestamp = lines ['Date and Time']
urls = lines['Url']
# for each line (url) query value
urls.each do |url|
v = Addressable::URI.parse(url).query_values["v"]
if (v)
puts v # prints value if found
end
end
上面的代码会输出所有请求中包含的视频ID,而不是 watch?v = ,因此有很多重复项。
如何让它只输出前缀为 watch?v = 的视频? (带时间戳和IP)。这表示视频实际上已播放。感谢。
答案 0 :(得分:1)
切片和切块uri的支持在ruby的核心uri
类中受到限制。另一个选项是addressable/uri
。
require 'addressable/uri'
uri=Addressable::URI.parse('http://www.youtube.com/watch?v=nLIH9cA-Ftg&feature=c4-overview-vl&list=PL1Gpi18n3tsp1GkZ9h4kKKoiJmOSyWpc4')
uri.query_values["v"] #query_values returns key-value pairs of query components
=> "nLIH9cA-Ftg"
这是一个片段
urls=["http://www.youtube.com/watch?v=chzEn7TmzJA", "http://www.youtube.com/watch?v=wAVl_IJV5eI&list=PL34B86ECEC1703D6F", "http://www.youtube.com/watch?v=8t2s9HSrkl8&list=PL34B86ECEC1703D6F", "http://www.youtube.com/watch?v=ssdqClUH00c", "http://www.youtube.com/watch?v=nLIH9cA-Ftg&feature=c4-overview-vl&list=PL1Gpi18n3tsp1GkZ9h4kKKoiJmOSyWpc4"]
urls.each do |url|
v = Addressable::URI.parse(url).query_values["v"]
puts v
end
返回
chzEn7TmzJA
wAVl_IJV5eI
8t2s9HSrkl8
ssdqClUH00c
nLIH9cA-Ftg
您可以addressable/uri
sudo gem install addressable
答案 1 :(得分:0)
在轨道上的红宝石中:
你可以试试这个:
require 'csv'
lines = CSV.readlines("path to csv file)
然后你可以在这些行上进行迭代:
lines.each |row| do
url_parameters = lines[n] # where n should be the position of column in csv
uri = URI.parse(url_parameters)
uri_params = CGI.parse(uri.query)
video_code = uri_params['v'].first
# this is the video code of the youtube url : You can do whatever is the requirement
end