无论如何使用JSoup保留纯文本页面中的换行符吗?我试图拉robots.txt而不是一行一行,它将整个身体标签拉成一行。
var response = Jsoup.connect("http://www.facebook.com/robots.txt").userAgent(userAgent).followRedirects(true).execute()
println(response.parse().body().text())
我在一行上得到文本回复,如下所示:
# Notice: Crawling Facebook is prohibited unless you have express written # permission. See: http://www.facebook.com/apps/site_scraping_tos_terms.php User-agent: Applebot Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: baiduspider Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: Bingbot Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: Googlebot Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: ia_archiver Disallow: / Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: msnbot Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: Naverbot Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: seznambot Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: Slurp Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: teoma Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: Twitterbot Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: Yandex Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: Yeti Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: Applebot Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: baiduspider Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: Bingbot Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: Googlebot Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: ia_archiver Allow: /about/privacy Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /full_data_use_policy Allow: /legal/terms Allow: /policy.php Allow: /safetycheck/ User-agent: msnbot Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: Naverbot Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: seznambot Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: Slurp Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: teoma Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: Twitterbot Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: Yandex Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: Yeti Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: * Disallow: / ]
我希望逐行解析文件(比如在浏览器中查看文件,并在其上执行正则表达式。任何帮助都将不胜感激。
谢谢
答案 0 :(得分:1)
如何以不同的方式执行此操作并将文件拉出来。 robots.txt
显然是一个文本文件,因此我们可以抓取它而不是试图抓取HTML。
这仍然使用Jsoup,与以前略有不同。
Connection.Response robotsText = Jsoup.connect( "https://www.facebook.com/robots.txt" ).execute();
FileOutputStream fileOutputStream = ( new FileOutputStream( new File( "robots.txt" ) ) );
fileOutputStream.write( robotsText.bodyAsBytes() );
fileOutputStream.close();
答案 1 :(得分:1)
unflattened文本可通过Jsoup中的TextNode获得。 E.g。
Document doc = Jsoup.connect("http://www.facebook.com/robots.txt").get();
doc.body().textNodes().get(0).getWholeText()