用rvest将复杂的html文件读入R

时间:2018-09-12 14:46:32

标签: html r rvest

我是R和stackoverflow的新手,所以请保持柔和,我将尽力使此帖子尽可能正确。 我正在做一个将整个外显子组测序(WES)结果与蛋白质组数据进行比较的项目。我们的WES设施仅将数据作为html文件发布,因此我需要将其读入R才能继续工作。

我尝试遵循DataCamp tutorial for rvest,但我认为问题可能在于html文件太复杂,因为我得到的是一堆\ t \ t \ tn \ n \ t,中间夹有一些文本。我想问题是不正确的html_node?

这是我的R代码,后跟经过简短修改的HTML。

我想得到的是一个与html中具有相同列的数据框。如示例中所示,某些变体会影响多个笔录,在这些情况下,单行/笔录将是完美的,但绝不是必须的。

非常感谢您的帮助!

塞巴斯蒂安

library(tidyverse)  
library(rvest)    

htmlALL <- read_html("Example_html")

getDATA <- function(html){
html %>%
html_nodes(".table") %>%
html_text() %>%
str_trim() %>%
unlist()

}

df_html <- getDATA(htmlALL)

<!DOCTYPE html
	PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
	 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
<head>
  <!-- add title in the brower tab bar -->
  <title>Homozygous variants of sample XXX </title>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>


<!-- change style to look nice -->
<style type="text/css">


html { 
  text-align: center;
  vertical-align: middle;
  height: 100%;
  width: 100%;
}
body { 
  background: #eee url('http://i.imgur.com/eeQeRmk.png'); /* http://subtlepatterns.com/weave/ */
  font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;
  font-size: 62.5%;
  entry-height: 1;
  color: #585858;
  padding: 22px 10px;
  padding-bottom: 55px;

}

::selection { background: #5f74a0; color: #fff; }
::-moz-selection { background: #5f74a0; color: #fff; }
::-webkit-selection { background: #5f74a0; color: #fff; }

br { display: block; entry-height: 1.6em; } 

input, textarea { 
  -webkit-font-smoothing: antialiased;
  -webkit-text-size-adjust: 100%;
  -ms-text-size-adjust: 100%;
  -webkit-box-sizing: border-box;
  -moz-box-sizing: border-box;
  box-sizing: border-box;
  outentry: none; 
}

blockquote, q { quotes: none; }
blockquote:before, blockquote:after, q:before, q:after { content: ''; content: none; }
strong, b { font-weight: bold; } 


h1 {
  font-weight: bold;
  font-size: 3.6em;
  entry-height: 1.7em;
  margin-bottom: 10px;
  text-align: center;
}

h2 {
  font-weight: bold;
  font-size: 2.6em;
  entry-height: 1.7em;
  margin-bottom: 10px;
  text-align: center;
}

/** big white sheet everything is on **/
.wrapper {
  display: block;
  width: 95%;
  background: #fff;
  margin: 0 auto;
  padding: 10px 17px 100px;
  box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  overflow-x: auto;
  overflow-y: visible;
}

/* smaller box the family information is on */
.info{
  display: block;
  width: 800px;
  background: #f2f2f2;
  margin: 0 auto;
  padding: 10px 17px 10px 10px;
  box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  font-size: 1.8em;
  margin-bottom: 10px;
}


/* this is what actually contains the info */
.table {
  display: table;
  margin: 0 auto;
  width: 99%;
  font-size: 1.2em;
  margin-bottom: 15px;
  border-collapse: collapse;
  overflow: visible;
}

/* one row of the variants */
.tablerow {
  display: table-row;
  overflow: visible;
  border: 1px solid gray;
  width: 100%;
}

/* header are bigger and may in the future be clickable to sort accordginly*/
.tableheader {
  display: table-cell;
  background: #f2f2f2;
  padding: 3px 10px;
  margin-bottom: 25px;
  font-size: 1.8em;
  box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
  -o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
}

/* in the following each column gets specified to increase readablity*/

.position {
  display: table-cell;
  padding: 3px 10px;
  font-size: 1.4em;
  height: 100%;
  text-align: center;
  vertical-align: middle;
}

.variants {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
  overflow: visible;
  white-space: nowrap;
  
}

.stacked {
  display: table;
  height: 50%;
  width: 100%;

}

.center {
  display: table-cell;
  vertical-align: middle;
  width: 100%;
  padding: 0px 5px;
}


.consequences {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
  padding: 3px 10px;
}

.gene {
  display: table-cell;
  padding: 3px 15px;
  height: 100%;
  vertical-align: middle;
  font-size: 1.4em;
  font-weight: bold;
}

.transcripts {
  display: table-cell;
  vertical-align: middle;
  height: 100%;
}

.list {
  height: 100%;
  width: 100%;
  display: table;
  table-layout: fixed;
}
.row {
  display: table-row;
  overflow: visible;
  vertical-align: middle;
}
.entry {
  display: table-cell;
  vertical-align:middle;
  padding: 0% 1% 0% 1%;
  white-space: nowrap;
  text-overflow: ellipsis;
  overflow: hidden;
}

.cdspos {
  display: table-cell;
  vertical-align: middle;
  height: 100%;
}

.exon {
  display: table-cell;
  vertical-align: middle;
  height: 100%;
}



.hgvs {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
}

.hgvs .list .row{
  display: table-row;
  vertical-align: middle;
}

.polyphen {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
}
.polyphen .list .row{
  display: table-row;
  vertical-align: middle;
}

.sift {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
}
.sift .list .row{
  display: table-row;
  vertical-align: middle;
}

.allelefreq {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
}



/* Tooltip container */
.tooltip_gene, .tooltip_allelefrq ,.tooltip_qual{
    position: relative;
    display: inline-block;
    border-bottom: 1px dotted black; /* If you want dots under the hoverable text */
    
}



.tooltiptext{
    visibility: hidden;
    overflow: auto;
    min-width: 400px;
    background-color: #ffb380;
    color: black;
    text-align: left;
    padding: 5px 10px;
    border-radius: 6px;
    font-size: 12pt;
    font-weight: normal;
    
    /* Position the tooltip text - see examples below! */
    position: absolute;
    z-index:1;
    
    /* shadow */
    box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
    -webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
    -moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
    -ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
    -o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
    
    opacity: 0.95;
    filter: alpha(opacity=95);

}

/* Tooltip text */
.tooltip_gene .tooltiptext {
    top: -5px;
    left: 105%;
 
}


/* Tooltip text */
.tooltip_allelefrq .tooltiptext {
    top: -5px;
    right: 105%;
    min-width: 120px;
    
 
}

/* Show the tooltip text when you mouse over the tooltip container */
.tooltip_allelefrq:hover .tooltiptext, .tooltip_gene:hover .tooltiptext {
    visibility: visible;
}


.clin {
  display: table-cell;
  height: 100%;
  vertical-align: middle;
  padding: 0% 1% 0% 1%;
  white-space: nowrap;
  text-overflow: ellipsis;
  overflow: hidden;
}

</style>


<body>
  <div class="wrapper">
      <!-- add info about patients -->
      <h1>Homozygous variants of sample XXX</h1>
      <h2>Tue Jan 23 09:01:56 2018</h2>
      <div class="info">
	
	  Patient only<br>
	
      </div>
      <!-- variants table start -->
      <div class="table">
	<!-- table header start -->
	<div class="tablerow">
	  <div class="tableheader">
	    Position
	  </div>
	  <div class="tableheader">
	    Variant
	  </div>
	  <div class="tableheader">
	    Cons
	  </div>
	  <div class="tableheader">
	    Gene
	  </div>
	  <div class="tableheader">
	    Transcript
	  </div>
	  <div class="tableheader">
	    HGVSC
	  </div>
	  <div class="tableheader">
	    HGVSP
	  </div>
	  <div class="tableheader">
	    PolyPhen
	  </div>
	  <div class="tableheader">
	    SIFT
	  </div>
	  <div class="tableheader">
	    AF
	  </div>
	  <div class="tableheader">
	    Clin
	  </div>
	</div>
	<!-- table header stop -->
	<!-- var loop start -->
	
	  <div class="tablerow" >
	    <!-- position start -->
	    <div class="position">
	      <a href="http://gnomad.broadinstitute.org/region/1-117635467-117635507">1:117635487</a>
	    </div>
	    <!-- position stop -->
	    <!-- variants start -->
	    <div class="variants">
	      
		
		  G->T
		
	      
	    </div>
	    <!-- variants stop -->
	    <!-- consequences start -->
	    <div class="consequences" style="background: rgb(196, 197, 198);">
	      
		synonymous
	      
	    </div>
	    <!-- consequences stop -->
	    <!-- gene start -->
	    <div class="gene" >
	      
	      
	      
		
		  <div class="tooltip_gene">
		    <a href="http://www.genecards.org/cgi-bin/carddisp.pl?gene=TTF2" >
		      TTF2
		    </a>
		    <span class="tooltiptext">GeneCards Summary<hr>
TTF2 (Transcription Termination Factor 2) is a Protein Coding gene.
Diseases associated with TTF2 include Sexual Sadism and Narcissistic Personality Disorder.
Among its related pathways are Human Thyroid Stimulating Hormone (TSH) signaling pathway and Insulin secretion.
GO annotations related to this gene include hydrolase activity and DNA-dependent ATPase activity.
An important paralog of this gene is HLTF.</span>
		  </div>
		
	    </div>
	    <!-- gene stop -->
	    <!-- transcripts start -->
	    <div class="transcripts">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      <a href="http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000369466">ENST00000369466
		      </a>
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- transcripts stop -->
	    <!-- exon start -->
	<!--    <div class="exon">
	      <div class="list">
		
	      </div>
	    </div>-->
	    <!-- exon stop -->
	    <!-- hgvsc start -->
	    <div class="hgvs">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			c.2940G>T
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- hgvsc stop -->
	    <!-- hgvsp start -->
	    <div class="hgvs">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			c.2940G>T(p.%3D)
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- hgvsp stop -->
	    <!-- polyphen start -->
	    <div class="polyphen">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- polyphen stop -->
	    <!-- sift start -->
	    <div class="sift">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- sift stop -->
	    <!--.allelefreq start -->
	    <div class="allelefreq">
	      
		
		  <div class="tooltip_allelefrq">
		    0.00000
		    <span class="tooltiptext">allele counts<hr>ht: <span style='float:right;'>0</span><br>hm: <span style='float:right;'>0</span><br>wt: <span style='float:right;'>0</span><hr>inhouse:<span style='float:right;'>0.00118</span></span>
		  </div>
		
	      
	    </div>
	    <!--.allelefreq stop -->
	    <!--.allelefreq start -->
	    <div class="clin">
	      
		
	      
	    </div>
	    <!--.allelefreq stop -->
	  </div>
	  <!-- table row stop-->
	
	 	
	  <div class="tablerow" >
	    <!-- position start -->
	    <div class="position">
	      <a href="http://gnomad.broadinstitute.org/region/1-149898435-149898475">1:149898455</a>
	    </div>
	    <!-- position stop -->
	    <!-- variants start -->
	    <div class="variants">
	      
		
		  
		      <a href="https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=rs143105666">G->A</a>
		  
		
	      
	    </div>
	    <!-- variants stop -->
	    <!-- consequences start -->
	    <div class="consequences" style="background: rgb(196, 197, 198);">
	      
		synonymous
	      
	    </div>
	    <!-- consequences stop -->
	    <!-- gene start -->
	    <div class="gene" >
	      
	      
	      
		
		  <div class="tooltip_gene">
		    <a href="http://www.genecards.org/cgi-bin/carddisp.pl?gene=SF3B4" >
		      SF3B4
		    </a>
		    <span class="tooltiptext">GeneCards Summary<hr>
SF3B4 (Splicing Factor 3b Subunit 4) is a Protein Coding gene.
Diseases associated with SF3B4 include Acrofacial Dysostosis 1, Nager Type and Acrofacial Dysostosis Syndrome Of Rodriguez.
Among its related pathways are mRNA Splicing - Major Pathway and Gene Expression.
GO annotations related to this gene include nucleic acid binding and nucleotide binding.
</span>
		  </div>
		
	    </div>
	    <!-- gene stop -->
	    <!-- transcripts start -->
	    <div class="transcripts">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      <a href="http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000457312">ENST00000457312
		      </a>
		    </div>
		  </div>
		
		  <div class="row">
		    <div class="entry">
		      <a href="http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000271628">ENST00000271628
		      </a>
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- transcripts stop -->
	    <!-- exon start -->
	<!--    <div class="exon">
	      <div class="list">
		
	      </div>
	    </div>-->
	    <!-- exon stop -->
	    <!-- hgvsc start -->
	    <div class="hgvs">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			c.390C>A
		      
		    </div>
		  </div>
		
		  <div class="row">
		    <div class="entry">
		      
			c.519C>A
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- hgvsc stop -->
	    <!-- hgvsp start -->
	    <div class="hgvs">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			c.390C>A(p.%3D)
		      
		    </div>
		  </div>
		
		  <div class="row">
		    <div class="entry">
		      
			c.519C>A(p.%3D)
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- hgvsp stop -->
	    <!-- polyphen start -->
	    <div class="polyphen">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- polyphen stop -->
	    <!-- sift start -->
	    <div class="sift">
	      <div class="list">
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
		  <div class="row">
		    <div class="entry">
		      
			
		      
		    </div>
		  </div>
		
	      </div>
	    </div>
	    <!-- sift stop -->
	    <!--.allelefreq start -->
	    <div class="allelefreq">
	      
		
		  <div class="tooltip_allelefrq">
		    0.00021
		    <span class="tooltiptext">allele counts<hr>ht: <span style='float:right;'>57</span><br>hm: <span style='float:right;'>0</span><br>wt: <span style='float:right;'>277082</span><hr>inhouse:<span style='float:right;'>0.00236</span></span>
		  </div>
		
	      
	    </div>
	    <!--.allelefreq stop -->
	    <!--.allelefreq start -->
	    <div class="clin">
	      
		
	      
	    </div>
	    <!--.allelefreq stop -->
	  </div>
	  <!-- table row stop-->
	 	
	<!-- var loop stop -->
      </div>
      <!-- variant table stop -->
    </div>
</body>
</html>

1 个答案:

答案 0 :(得分:3)

这是我能为您提供的最好的服务。请注意,当您将鼠标悬停在Gene列中的数据上时,输出将包含“工具提示文本”。

library(rvest)

# I saved your sample to my Desktop as test.html
pg = read_html('~/Desktop/test.html')

# count rows (including header):
n_rows = pg %>% html_nodes('div.tablerow') %>% length

# sprintf-friendly format to get the %d-th node matching
#   //div[@class="tablerow"] (same as div.tablerow in CSS)
#   All of the /div after this are columns
xp_fmt = '//div[@class="tablerow"][%d]/div'

# div.tableheader nodes contain column names
col_names = pg %>% html_nodes(xpath = sprintf(xp_fmt, 1L)) %>% 
  html_text %>% trimws

# rows 2:n contain the actual data; gsub is
#   stripping leading/trailing whitespace and 
#   any duplicate internal whitespace
rows = lapply(2:n_rows, function(ii) {
  pg %>% html_nodes(xpath = sprintf(xp_fmt, ii)) %>% 
    html_text %>% gsub('^\\s+|\\s{2,}|\\s+$', '', .)
})

# can't forget those pesky factors
DF = as.data.frame(do.call(rbind, rows), stringsAsFactors = FALSE)
names(DF) = col_names
DF
#      Position Variant       Cons
# 1 1:117635487    G->T synonymous
# 2 1:149898455    G->A synonymous
#                                                                                                                                                                                                                                                                                                                                                                                                                                                     Gene
# 1 TTF2GeneCards Summary\nTTF2 (Transcription Termination Factor 2) is a Protein Coding gene.\nDiseases associated with TTF2 include Sexual Sadism and Narcissistic Personality Disorder.\nAmong its related pathways are Human Thyroid Stimulating Hormone (TSH) signaling pathway and Insulin secretion.\nGO annotations related to this gene include hydrolase activity and DNA-dependent ATPase activity.\nAn important paralog of this gene is HLTF.
# 2                                                       SF3B4GeneCards Summary\nSF3B4 (Splicing Factor 3b Subunit 4) is a Protein Coding gene.\nDiseases associated with SF3B4 include Acrofacial Dysostosis 1, Nager Type and Acrofacial Dysostosis Syndrome Of Rodriguez.\nAmong its related pathways are mRNA Splicing - Major Pathway and Gene Expression.\nGO annotations related to this gene include nucleic acid binding and nucleotide binding.
#                       Transcript            HGVSC
# 1                ENST00000369466        c.2940G>T
# 2 ENST00000457312ENST00000271628 c.390C>Ac.519C>A
#                            HGVSP PolyPhen SIFT
# 1               c.2940G>T(p.%3D)              
# 2 c.390C>A(p.%3D)c.519C>A(p.%3D)              
#                                                         AF
# 1       0.00000allele countsht: 0hm: 0wt: 0inhouse:0.00118
# 2 0.00021allele countsht: 57hm: 0wt: 277082inhouse:0.00236
#   Clin
# 1     
# 2     

请注意,由于您的所有列似乎都是character类型,因此此处并不适用,但是更复杂的方法会将此处的行转换为常规文件(例如csv),然后使用read.table(或更好的fread)读取文本并自动检测列类型。