在网页上抓取用户名的好方法

时间:2012-12-20 11:10:17

标签: web scrape

我想从youtube评论中抓取用户名,例如页面:

http://www.youtube.com/all_comments?v=mIA0W69U2_Y

我希望获得所有用户名/显示名称:“fedfields”,“mystik dread” 和相应的链接(当你点击“fedfields”,它将链接到其个人资料) 我想使用自动bash脚本来剪贴它们 我有以下问题:

1我原来的方法是编写自动脚本,使用wget下载页面,然后使用正则表达式来处理页面以获取这些名称,但是这样,我需要下载整个页面,每页都是几MB,如果我下载了很多页面,它会花费很多空间,有更好的方法吗?

2有很多页面,比如在链接中,有7页,是否可以将它们全部放在一个页面中?

6 个答案:

答案 0 :(得分:2)

您可以在C#应用程序中使用HtmlAgilityPack。

        HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
        HtmlAgilityPack.HtmlDocument doc = web.Load(Url);
        IEnumerable<HtmlNode> userNames = doc.DocumentNode.Descendants("a").Where(
            d => d.Attributes.Contains("class") &&   
            d.Attributes["class"].Value.Contains("yt-user-name"));

Useful info about parsing html with RegEx

我不知道youtube内容是否具有本机gzip压缩,但您可以使用WebRequest类进行检查。如果是,它将显着减少流量。

webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.Method = WebRequestMethods.Http.Get;
webRequest.KeepAlive = true;
webRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
webRequest.Headers.Add("Accept-Encoding", "gzip,deflate");
HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse(); 
MessageBox.Show(webResponse.ContentEncoding.ToString());

然后您可以使用HTMLAgilityPack读取流并获取用户名。

答案 1 :(得分:2)

在mashape上使用ScrapeGoat将所有用户名作为json对象返回:)

https://www.mashape.com/warting/scrapegoat/

curl --include --request GET 'https://scrapegoat.p.mashape.com/?url=http%3A%2F%2Fwww.youtube.com%2Fall_comments%3Fv%3DmIA0W69U2_Y&selector=.yt-user-name' --header "X-Mashape-Authorization: <MASHAPE API KEY>"

结果:

{"message":"ok","payload":["whitehouse","Osambasucks2","Osambasucks2","Osambasucks2","omar barazanji","omar barazanji","omar barazanji","omar barazanji","omar barazanji","omar barazanji","HigherPlanes","HigherPlanes","HigherPlanes","RamonaFromPomona","RamonaFromPomona","Osambasucks2","Osambasucks2","Osambasucks2","RamonaFromPomona","terminator360tm","terminator360tm","terminator360tm","terminator360tm","terminator360tm","terminator360tm","Osambasucks2","Osambasucks2","Osambasucks2","Joe Lackey","Joe Lackey","Joe Lackey","ThaGenius101","ThaGenius101","ThaGenius101","Joe Lackey","Ed Patowski","Ed Patowski","Ed Patowski","toughdogyt","toughdogyt","toughdogyt","Osambasucks2","Osambasucks2","Osambasucks2","goodkarmaband","goodkarmaband","Martynas Valiukas","Martynas Valiukas","Martynas Valiukas","goodkarmaband","goodkarmaband","goodkarmaband","Martynas Valiukas","XRedstone688X","XRedstone688X","XRedstone688X","goodkarmaband","Trevor Jones","Trevor Jones","Trevor Jones","goodkarmaband","V V","V V","V V","V V","V V","V V","V V","V V","V V","V V","V V","V V","leeman6417","leeman6417","leeman6417","Osambasucks2","Osambasucks2","Osambasucks2","leeman6417","sosocrazy1234","sosocrazy1234","sosocrazy1234","leeman6417","liamdudeeee","liamdudeeee","liamdudeeee","sosocrazy1234","sosocrazy1234","sosocrazy1234","sosocrazy1234","leeman6417","Ed Patowski","Ed Patowski","Ed Patowski","mastershakelock","mastershakelock","mastershakelock","VGQgex","VGQgex","VGQgex","Osambasucks2","Osambasucks2","Osambasucks2","VGQgex","MindzEnt","MindzEnt","MindzEnt","William willie","William willie","William willie","William willie","William willie","William willie","bkdmd","bkdmd","bkdmd","Osambasucks2","Osambasucks2","Osambasucks2","bkdmd","Rafael Vargas","Rafael Vargas","Rafael Vargas","7even2wenty1","7even2wenty1","7even2wenty1","cashlessbread","cashlessbread","cashlessbread","base3798","base3798","base3798","Ed Patowski","Ed Patowski","Ed Patowski","base3798","john smith","john smith","john smith","Ed Patowski","Neftali Acosta","Neftali Acosta","Neftali Acosta","Ed Patowski","Ed Patowski","Ed Patowski","Neftali Acosta","john smith","john smith","john smith","Neftali Acosta","Canal YooCheckTheFloow","Canal YooCheckTheFloow","Canal YooCheckTheFloow","Abandonbeast","Abandonbeast","Abandonbeast","Canal YooCheckTheFloow","Ironcitytony72","Ironcitytony72","Ironcitytony72","john smith","john smith","john smith","Ironcitytony72","Andrew Apelt","Andrew Apelt","Andrew Apelt","Ironcitytony72","Osambasucks2","Osambasucks2","Osambasucks2","Andrew Apelt","Andrew Apelt","Andrew Apelt","Andrew Apelt","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Andrew Apelt","incas94","incas94","incas94","Osambasucks2","William willie","William willie","William willie","incas94","Osambasucks2","Osambasucks2","Osambasucks2","incas94","Osambasucks2","Osambasucks2","Osambasucks2","incas94","Osambasucks2","Osambasucks2","Osambasucks2","incas94","Andrew Apelt","Andrew Apelt","Osambasucks2","LawnMowerfromHell","LawnMowerfromHell","LawnMowerfromHell","Ironcitytony72","Osambasucks2","Osambasucks2","Osambasucks2","TheAndr3tzi","TheAndr3tzi","TheAndr3tzi","thumsupformyusername","thumsupformyusername","thumsupformyusername","algett","algett","algett","thumsupformyusername","thumsupformyusername","thumsupformyusername","thumsupformyusername","algett","ferkondenster","ferkondenster","ferkondenster","Christian Heinrich","Christian Heinrich","Christian Heinrich","erieejustice911","erieejustice911","erieejustice911","ferkondenster","ferkondenster","ferkondenster","Seth Farsides","Seth Farsides","Seth Farsides","ferkondenster","ferkondenster","ferkondenster","Seth Farsides","Seth Farsides","Seth Farsides","ferkondenster","Doky9889","Doky9889","Doky9889","ferkondenster","ferkondenster","ferkondenster","ferkondenster","Doky9889","sealrk19","sealrk19","sealrk19","wiljam12345","wiljam12345","wiljam12345","Dwayne Cole","Dwayne Cole","Dwayne Cole","Osambasucks2","Osambasucks2","Osambasucks2","Dwayne Cole","Jax Jr","Jax Jr","Jax Jr","Rafael Vargas","Rafael Vargas","Rafael Vargas","William willie","William willie","William willie","William willie","William willie","William willie","Gunnar Rowe","Gunnar Rowe","Gunnar Rowe","Rafael Vargas","Rafael Vargas","Rafael Vargas","Susan Porter","Susan Porter","Susan Porter","derp toth","derp toth","derp toth","MXNR16","nick62301","nick62301","nick62301","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","SeventhSun","SeventhSun","SeventhSun","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Rafael Vargas","Rafael Vargas","Rafael Vargas","senormierda","senormierda","senormierda","Rafael Vargas","chrisgilofficial","chrisgilofficial","chrisgilofficial","MXNR16","Osambasucks2","Osambasucks2","Osambasucks2","chrisgilofficial","chrisgilofficial","chrisgilofficial","chrisgilofficial","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","chrisgilofficial","chrisgilofficial","chrisgilofficial","chrisgilofficial","Osambasucks2","Andrew Apelt","Andrew Apelt","chrisgilofficial","Osambasucks2","Osambasucks2","Osambasucks2","chrisgilofficial","aztecadog","aztecadog","aztecadog","chrisgilofficial","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","chrisgilofficial","Osambasucks2","Osambasucks2","Osambasucks2","chrisgilofficial","ThePhase20","ThePhase20","ThePhase20","ICE778","ICE778","ICE778","Sabrina Blacks","Sabrina Blacks","Sabrina Blacks","Darwin Gutierrez","Darwin Gutierrez","Darwin Gutierrez","lessonsfromryan","tooncrazy1","tooncrazy1","tooncrazy1","unbreackable3000","unbreackable3000","unbreackable3000","Barack Obama","Barack Obama","Barack Obama","Osambasucks2","Osambasucks2","Osambasucks2","Barack Obama","tooncrazy1","tooncrazy1","tooncrazy1","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","tooncrazy1","Osambasucks2","Osambasucks2","Osambasucks2","tooncrazy1","Osambasucks2","Osambasucks2","Osambasucks2","Barack Obama","Americaunderduress","Americaunderduress","Americaunderduress","Barack Obama","Barack Obama","Barack Obama","Osambasucks2","Osambasucks2","Osambasucks2","Barack Obama","FoodStampBarry","FoodStampBarry","FoodStampBarry","Barack Obama","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","myviewsontheworld","myviewsontheworld","myviewsontheworld","SuperNikoYT","SuperNikoYT","SuperNikoYT","myviewsontheworld","Osambasucks2","Osambasucks2","Osambasucks2","myviewsontheworld","Americaunderduress","Americaunderduress","Americaunderduress","myviewsontheworld","Asuma741","Asuma741","Asuma741","RevolutionNewz","damonjo15","damonjo15","damonjo15","Osambasucks2","Osambasucks2","Osambasucks2","damonjo15","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","tooncrazy1","tooncrazy1","tooncrazy1","Aries2012100","KH AK","KH AK","KH AK","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","kangaroo3259","kangaroo3259","kangaroo3259","Aries2012100","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","youhan younen","youhan younen","youhan younen","tooncrazy1","tooncrazy1","tooncrazy1","youhan younen","Osambasucks2","Osambasucks2","Osambasucks2","youhan younen","Stevejobsultimate2","Stevejobsultimate2","Stevejobsultimate2","Stevejobsultimate2","Stevejobsultimate2","Stevejobsultimate2","Osambasucks2","Osambasucks2","Osambasucks2","Stevejobsultimate2","Rafael Vargas","Rafael Vargas","Rafael Vargas","drewpert0515","drewpert0515","drewpert0515","dv wfwefwe","TheAlienContactee","TheAlienContactee","TheAlienContactee","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","Jordan Beckwith","Jordan Beckwith","Jordan Beckwith","Michael Carrillo","Michael Carrillo","Michael Carrillo","gotwess","gotwess","gotwess","gotwess","Michael Carrillo","Michael Carrillo","Michael Carrillo","Michael Carrillo","gotwess","Jawad Pullin","Jawad Pullin","Jawad Pullin","TreborHG93","tooncrazy1","tooncrazy1","tooncrazy1","chickeneggchickeneg1","chickeneggchickeneg1","chickeneggchickeneg1","chickeneggchickeneg1","chickeneggchickeneg1","chickeneggchickeneg1","kinggrindhard","kinggrindhard","kinggrindhard","branoaas branoaas","branoaas branoaas","branoaas branoaas","Osambasucks2","Osambasucks2","Osambasucks2","branoaas branoaas","branoaas branoaas","branoaas branoaas","branoaas branoaas","Theindicud","Theindicud","Theindicud","eizieizz","eizieizz","eizieizz","Osambasucks2","Osambasucks2","Osambasucks2","eizieizz","1990Zuck","1990Zuck","1990Zuck","ArcoZakus","ArcoZakus","ArcoZakus","firemedic30ca","johnny grove","johnny grove","johnny grove","joost1v","joost1v","joost1v","Osambasucks2","Osambasucks2","Osambasucks2","joost1v","5sdk1","5sdk1","5sdk1","jeff brennan","jeff brennan","jeff brennan","izizdropshotz","izizdropshotz","izizdropshotz","izizdropshotz","jeff brennan","jeff brennan","jeff brennan","jeff brennan","Bo James","aztecadog","aztecadog","aztecadog","izizdropshotz","izizdropshotz","izizdropshotz","aztecadog","izizdropshotz","izizdropshotz","izizdropshotz","aztecadog","Greg Cimera","Greg Cimera","Greg Cimera","izizdropshotz","izizdropshotz","izizdropshotz","izizdropshotz","Greg Cimera","Greg Cimera","Greg Cimera","Greg Cimera","izizdropshotz","izizdropshotz","izizdropshotz","izizdropshotz","Greg Cimera","Greg Cimera","Greg Cimera","Greg Cimera","izizdropshotz","izizdropshotz","izizdropshotz","izizdropshotz","Greg Cimera","Greg Cimera","Greg Cimera","Greg Cimera","izizdropshotz","izizdropshotz","izizdropshotz","izizdropshotz","Greg Cimera","Greg Cimera","Greg Cimera","Greg Cimera","izizdropshotz","Paul Pascalau","Paul Pascalau","Paul Pascalau","Greg Cimera","tooncrazy1","tooncrazy1","tooncrazy1","Greg Cimera","Greg Cimera","Greg Cimera","Greg Cimera","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","Greg Cimera","Greg Cimera","Greg Cimera","Greg Cimera","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","Greg Cimera","Greg Cimera","Greg Cimera","Greg Cimera","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","Greg Cimera","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","aztecadog","aztecadog","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","aztecadog","aztecadog","aztecadog","Osambasucks2","izizdropshotz","izizdropshotz","izizdropshotz","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","Zajac Staszek","Zajac Staszek","Zajac Staszek","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Zajac Staszek","Zajac Staszek","Zajac Staszek","Zajac Staszek","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Zajac Staszek","Zajac Staszek","Zajac Staszek","Zajac Staszek","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Zajac Staszek","Osambasucks2","Osambasucks2","Osambasucks2","Zajac Staszek","Ed Patowski","Ed Patowski","Ed Patowski","Zajac Staszek","aztecadog","aztecadog","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","gotwess","gotwess","gotwess","aztecadog","JeremyTheMoose","JeremyTheMoose","JeremyTheMoose","5sdk1","5sdk1","5sdk1","fordbronco1991","fordbronco1991","fordbronco1991","andy kerver","andy kerver","andy kerver","Omarimage","Omarimage","Omarimage","Omarimage","Omarimage","Omarimage","justin lionti","justin lionti","justin lionti","Omarimage","Butheadbros2","Butheadbros2","Butheadbros2","Omarimage","moonbeamrider1","moonbeamrider1","moonbeamrider1","justin lionti","justin lionti","justin lionti","moonbeamrider1","moonbeamrider1","moonbeamrider1","moonbeamrider1","justin lionti","fordbronco1991","fordbronco1991","fordbronco1991","pellenyberg","pellenyberg","pellenyberg","Son Goku","Son Goku","Son Goku","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","Butheadbros2","Butheadbros2","Butheadbros2","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","Butheadbros2","Butheadbros2","Butheadbros2","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","Butheadbros2","Butheadbros2","Butheadbros2","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","Butheadbros2","Butheadbros2","Butheadbros2","Butheadbros2","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","Butheadbros2","Butheadbros2","Butheadbros2","Butheadbros2","5ilv3rbvll","fisch kopf","fisch kopf","fisch kopf","andrew baker","andrew baker","andrew baker","FVCKDA POPO","FVCKDA POPO","FVCKDA POPO","MrChessmans","MrChessmans","MrChessmans","BryndisiDali","Brazzer man","Brazzer man","Brazzer man","Jack Thompson","ecw141685","ecw141685","ecw141685","Osambasucks2","Osambasucks2","Osambasucks2","ecw141685","lps24evelyn","lps24evelyn","lps24evelyn","erieejustice911","erieejustice911","erieejustice911","erieejustice911","erieejustice911","erieejustice911","Keepskatin","Keepskatin","Keepskatin","erieejustice911","V V","V V","V V","Keepskatin","Abrahan Peraza","Abrahan Peraza","Abrahan Peraza","lexyloveful","Zratedguns","Zratedguns","Zratedguns","MadNoys1","MadNoys1","MadNoys1","MadNoys1","Zratedguns","MadNoys1","MadNoys1","MadNoys1","MadNoys1","MadNoys1","MadNoys1","MadNoys1","MadNoys1","MadNoys1","Joseph Pal","Joseph Pal","Joseph Pal","Joseph Pal","MadNoys1","MadNoys1","MadNoys1","MadNoys1","bear cat","laurynas stirbys","laurynas stirbys","laurynas stirbys","newjerusalem newtestament","newjerusalem newtestament","newjerusalem newtestament","amerilstones","amerilstones","amerilstones","newjerusalem newtestament","Keepskatin","Keepskatin","Keepskatin","newjerusalem newtestament","amerilstones","amerilstones","amerilstones","Keepskatin","Noah Neo","Noah Neo","Noah Neo","charmander4533","charmander4533","charmander4533","Noah Neo","Noah Neo","Noah Neo","Noah Neo","charmander4533","Noah Neo","Noah Neo","Noah Neo","charmander4533","Osambasucks2","Osambasucks2","Osambasucks2","Noah Neo","George Washington","George Washington","George Washington","charmander4533","izizdropshotz","izizdropshotz","izizdropshotz","charmander4533","Wavanova","Wavanova","Wavanova","charmander4533","wisestfoolalive","wisestfoolalive","wisestfoolalive","Noah Neo","Noah Neo","Noah Neo","Noah Neo","wisestfoolalive","colin dooley","colin dooley","colin dooley","colin dooley","colin dooley","colin dooley","Silme037","Silme037","Silme037","colin dooley","Keepskatin","Keepskatin","Keepskatin","colin dooley","princelord55","princelord55","princelord55","Osambasucks2","Osambasucks2","Osambasucks2","princelord55","DriadonRapShow","DriadonRapShow","DriadonRapShow","eddrum100","eddrum100","eddrum100","Ryan S","Osambasucks2","Osambasucks2","Osambasucks2","eddrum100","eddrum100","eddrum100","eddrum100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","eddrum100","Ryan S","Ryan S","Ryan S","eddrum100","eddrum100","eddrum100","Ryan S","Ryan S","Ryan S","Ryan S","Ryan S","eddrum100","eddrum100","eddrum100","eddrum100","RatedMForModz","RatedMForModz","RatedMForModz","alban97","alban97","alban97","RatedMForModz","Alex Bannon","Alex Bannon","Alex Bannon","alban97","alban97","alban97","alban97","Alex Bannon","james aaron","james aaron","james aaron","RatedMForModz","Ryan S","Ryan S","Ryan S","Dylan N","killllshot","killllshot","killllshot","Saadia Khan","Saadia Khan","talithatf17","talithatf17","talithatf17","amerilstones","amerilstones","amerilstones","talithatf17","BENGHAZIneverForget","BENGHAZIneverForget","BENGHAZIneverForget","talithatf17","talithatf17","talithatf17","supergrover6868","supergrover6868","supergrover6868","talithatf17","Alexander Sigsworth","Alexander Sigsworth","Alexander Sigsworth","supergrover6868","Zratedguns","Zratedguns","Zratedguns","supergrover6868","Keepskatin","Keepskatin","Keepskatin","Zratedguns","Butheadbros2","Butheadbros2","Butheadbros2","Zratedguns","Omegeist","Omegeist","Omegeist","supergrover6868","2Dmensions","2Dmensions","2Dmensions","talithatf17","talithatf17","talithatf17","supergrover6868","supergrover6868","supergrover6868","talithatf17","newjerusalem newtestament","newjerusalem newtestament","newjerusalem newtestament","supergrover6868","VGQgex","VGQgex","VGQgex","talithatf17","talithatf17","talithatf17","talithatf17","Mandragara","Mandragara","Mandragara","talithatf17","deathzbo","deathzbo","deathzbo","Mandragara","Mandragara","Mandragara","deathzbo","Mandragara","Mandragara","Mandragara","deathzbo","deathzbo","deathzbo","deathzbo","Mandragara","eddrum100","eddrum100","eddrum100","Mandragara","Mandragara","Mandragara","Mandragara","eddrum100","Unit01232","Unit01232","Unit01232","supergrover6868","supergrover6868","supergrover6868","Unit01232","Osambasucks2","Osambasucks2","Osambasucks2","supergrover6868","eddrum100","eddrum100","eddrum100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","eddrum100","Osambasucks2","Osambasucks2","Osambasucks2","Unit01232","Unit01232","Unit01232","Unit01232","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Unit01232","eddrum100","eddrum100","eddrum100","senormierda","Osambasucks2","Osambasucks2","Osambasucks2","eddrum100","eddrum100","eddrum100","eddrum100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","eddrum100","Kevin Koala","Kevin Koala","Kevin Koala","senormierda","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","GGRSC","GGRSC","GGRSC","GGRSC","eddrum100","michael smith","michael smith","michael smith","GGRSC","GGRSC","GGRSC","truthinvideos","supergrover6868","supergrover6868","supergrover6868","GGRSC","supergrover6868","supergrover6868","supergrover6868","eddrum100","eddrum100","eddrum100","supergrover6868","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","eddrum100","eddrum100","eddrum100","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","eddrum100","eddrum100","eddrum100","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","eddrum100","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","supergrover6868","eddrum100","eddrum100","eddrum100","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","supergrover6868","supergrover6868","supergrover6868","bobothecreepyclown","eddrum100","eddrum100","eddrum100","supergrover6868","supergrover6868","supergrover6868","supergrover6868","eddrum100","eddrum100","eddrum100","eddrum100","supergrover6868","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","willypdyer","willypdyer","willypdyer","Osambasucks2","Osambasucks2","Osambasucks2","willypdyer","spairtain","spairtain","spairtain","DigitalAcceptance","DigitalAcceptance","DigitalAcceptance","ElRancholo2","Osambasucks2","Osambasucks2","Osambasucks2","DigitalAcceptance","ElRancholo2","ElRancholo2","ElRancholo2","DigitalAcceptance","Osambasucks2","Osambasucks2","Osambasucks2","ElRancholo2","Mark Tse","Mark Tse","Mark Tse","DigitalAcceptance","Mark Tse","Mark Tse","Mark Tse","Mark Tse","The Best","The Best","The Best","supergrover6868","supergrover6868","supergrover6868","creativeengineer","creativeengineer","creativeengineer","eddrum100","Ed Patowski","Ed Patowski","Ed Patowski","creativeengineer","Ed Patowski","Ed Patowski","Ed Patowski","creativeengineer","creativeengineer","creativeengineer","creativeengineer","Ed Patowski","Ed Patowski","Ed Patowski","Ed Patowski","creativeengineer","eddrum100","eddrum100","eddrum100","creativeengineer","creativeengineer","creativeengineer","creativeengineer","eddrum100","Osambasucks2","Osambasucks2","Osambasucks2","creativeengineer","supergrover6868","supergrover6868","supergrover6868","creativeengineer","creativeengineer","creativeengineer","creativeengineer","supergrover6868","supergrover6868","supergrover6868","creativeengineer","comicozy87","comicozy87","comicozy87","Raven Gomez","turbidhat","turbidhat","turbidhat","Daracon1010","Daracon1010","Daracon1010","Daracon1010","turbidhat","turbidhat","turbidhat","Daracon1010","VGQgex","VGQgex","VGQgex","Daracon1010","Daracon1010","Daracon1010","Daracon1010","VGQgex","WeThePeopleNoNWO","WeThePeopleNoNWO","WeThePeopleNoNWO","amerilstones","zmanthecool","zmanthecool","zmanthecool","metal220","supergrover6868","supergrover6868","supergrover6868","1974wolfman","1974wolfman","1974wolfman","William willie","William willie","William willie","1974wolfman","1974wolfman","1974wolfman","1974wolfman","William willie","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Kanwar Judge","Kanwar Judge","Kanwar Judge","getsumdonginurmouth","getsumdonginurmouth","getsumdonginurmouth","abu bakr","abu bakr","abu bakr","Obamalies100","Obamalies100","Obamalies100","eddrum100","Obamalies100","Obamalies100","Obamalies100","eddrum100","eddrum100","eddrum100","eddrum100","Obamalies100","Obamalies100","Obamalies100","Obamalies100","eddrum100","eddrum100","eddrum100","eddrum100","Obamalies100","Obamalies100","Obamalies100","Obamalies100","eddrum100","Obamalies100","Obamalies100","Obamalies100","eddrum100","eddrum100","eddrum100","eddrum100","Obamalies100","Obamalies100","Obamalies100","Obamalies100","eddrum100","Obamalies100","Obamalies100","Obamalies100","eddrum100","eddrum100","eddrum100","eddrum100","Obamalies100","Obamalies100","Obamalies100","Obamalies100","eddrum100","Obamalies100","Obamalies100","Obamalies100","eddrum100","Obamalies100","Obamalies100","Obamalies100","eddrum100","Obamalies100","Obamalies100","Obamalies100","eddrum100","Obamalies100","Obamalies100","Obamalies100","getsumdonginurmouth","getsumdonginurmouth","getsumdonginurmouth","getsumdonginurmouth","Obamalies100","amerilstones","amerilstones","amerilstones","getsumdonginurmouth","getsumdonginurmouth","getsumdonginurmouth","getsumdonginurmouth","amerilstones","amerilstones","amerilstones","amerilstones","getsumdonginurmouth","Obamalies100","Obamalies100","Obamalies100","getsumdonginurmouth","Obamalies100","Obamalies100","Obamalies100","getsumdonginurmouth","Obamalies100","Obamalies100","Obamalies100","eddrum100","ThaYayo","ThaYayo","ThaYayo","William willie","chrisn365","chrisn365","chrisn365","Eli Jackson","Eli Jackson","Eli Jackson","Jboulos12","Frank Adams","Frank Adams","Frank Adams","amerilstones","amerilstones","amerilstones","eddrum100","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","eddrum100","eddrum100","eddrum100","amerilstones","amerilstones","amerilstones","amerilstones","eddrum100","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","supergrover6868","supergrover6868","supergrover6868","amerilstones","amerilstones","amerilstones","amerilstones","supergrover6868","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","LiamborninDC","LiamborninDC","LiamborninDC","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","LiamborninDC","Osambasucks2","Osambasucks2","Osambasucks2","William willie","Osambasucks2","Osambasucks2","Osambasucks2","killllshot","killllshot","killllshot","killllshot","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","killllshot","killllshot","killllshot","killllshot","Osambasucks2","supergrover6868","supergrover6868","supergrover6868","killllshot","Osambasucks2","Osambasucks2","Osambasucks2","killllshot"],"status":200}

答案 2 :(得分:0)

执行此操作:

import re
import sys
import time
import urllib2

html = True

argv_list = sys.argv
if len(argv_list) == 2:
    vid = argv_list[1]
else:
    vid = "mIA0W69U2_Y"

regex = re.compile("<span class=\"author.*?<a href=\"(.*?)\".*? dir=\"ltr\">(.*?)</a>", re.DOTALL | re.UNICODE | re.IGNORECASE)

index = 1
author_lists = []
t1 = time.time()
print "######################### Start #########################"

while 1:
    url = "http://www.youtube.com/watch_ajax?action_get_comments=1&v="+vid+"&commenttype=everything&source=w&page_size=500&p="+str(index)+"&format=XML"
    print "Retrieving page "+str(index)+": ", url
    o = urllib2.urlopen(url)
    r = o.read()
    elements = regex.findall(r)
    author_list = []
    for x, y in elements:

        if x.startswith("http://") or x.startswith("https://"):
            continue
        xx = "".join(["http://www.youtube.com", x])
        href = xx.strip()
        #print href


        if "</span>" not in y :
            uname = y.strip()
        else:
            uname = y.split("</span>")[0].strip()

        if uname.startswith("<a"):
            continue

        if not uname or not href:
            continue

        if html:
            #1 output html
            author = "".join(["<a href=\"", href, "\">", uname, "</a>"])
        else:
            #2 output txt
            author = " ".join([uname, href])

        author_list.append(author)

    t = "%02d:%02d:%02d" % reduce(lambda ll,b : divmod(ll[0],b) + ll[1:], [(time.time()-t1,),60,60])
    print "".join(["Time passed: ", t])
    if not author_list:
        break
    else:
        author_lists.extend(author_list)
    index+=1
    #break #uncomment it if you only want to test one page

print "######################### Finished #########################"
print "Total comments: ", len(author_lists)
if author_lists:
    author_lists.sort()
    last = author_lists[-1]
    for i in range(len(author_lists)-2, -1, -1):
        if last == author_lists[i]:
            del author_lists[i]
        else:
            last = author_lists[i]
    if html:
        authors = "<br>".join(author_lists)
        authors = "".join(["<html><meta http-equiv='Content-Type' content='text/html; charset=utf-8'><body>", authors, "</body></html>"])
        fname = vid+".html"
    else:
        authors = "\n".join(author_lists)
        fname = vid+".txt"

    #print "Authors: ", authors
    print "Total commenters: ", len(author_lists)



    oo = open(fname, "w")
    oo.write(authors)
    oo.close()
print "######################### Exist #########################"

示例txt输出:

enter image description here

示例html输出:

enter image description here

答案 3 :(得分:0)

C#也可以帮助这种方式(尽管HAP和WebRequest更好):

     SHDocVw.InternetExplorer ie = new
            SHDocVw.InternetExplorerClass();
            WebBrowser wb = (WebBrowser)ie;
            wb.Visible = true;
            //Do anything else with the window here that you wish
            wb.Navigate("https://adwords.google.co.uk/um/Logout", ref o, ref o, ref o, ref o);
            while (wb.Busy) { Thread.Sleep(100); }
            HTMLDocument document = ((HTMLDocument)wb.Document);
            IHTMLElement element = document.getElementById("Email");
            HTMLInputElementClass email = (HTMLInputElementClass)element;
            email.value = "testtestingtton@gmail.com";
            email = null;
            element = document.getElementById("Passwd");
            HTMLInputElementClass pass = (HTMLInputElementClass)element;
            pass.value = "pass";
            pass = null;
            element = document.getElementById("signIn");
            HTMLInputElementClass subm = (HTMLInputElementClass)element;
            subm.click();
            subm = null;

答案 4 :(得分:0)

为名称字段和要提取的其他字段编写rssfeeds使用自动插件设置抓取工具按照以下步骤How to extract the data from multiple website

答案 5 :(得分:0)

以下是使用 ruby​​ and gems nokogiri和open-uri的简单解决方案

require 'nokogiri'
require 'open-uri'
url="https://www.youtube.com/all_comments?v=mIA0W69U2_Y"
dom=Nokogiri::HTML(open(url))
dom.xpath("//div[@class='comment-entry']").each do |comment|
  username=comment.xpath(".//a[contains(@class,'user-name')]").first
  username=username.content.chomp.strip if username
  profilelink=comment.xpath(".//a[contains(@class,'user-name')]/@href").first
  profilelink=profilelink.content.chomp.strip if profilelink
  profilelink="http://www.youtube.com"+profilelink if profilelink.match(/^\//)
  puts "#{username} #{profilelink}" if username and profilelink
end

有关详细信息,请访问How to extract data easily from multiple websites