为html

Question

我需要识别所有带有“方法”一词的标签。

我使用请求和正则表达式开发了python代码。该代码将首先读取一个文本文件以提取ID，然后使用请求打开URL来识别其中具有“ method”关键字的标签，但是输出将返回空列表。以下是代码：

import requests
from bs4 import BeautifulSoup as bs
import re


def read_file():


  with open("C://Users//reshma.regi//PycharmProjects//Method_mining_from_jornals//test_.txt") as f:
        content= f.readlines()
        content = [x.strip() for x in content]
for pmcid in content:
    r = requests.get('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id='+pmcid+'=my_tool&email=my_email@example.com')
    soup = bs(r.content, 'lxml')
    pmc = soup.findAll(re.compile(r'(methods)'))
    print(pmc)

def main():
    read_file()

if __name__ == '__main__':
    main()

要测试代码，可以使用以下pmcid：[2150890,2364767]

所需的PMCID：2150890输出为：

    <title>Materials and methods</title>
    <sec>
<title>Chromatin unfolding assay</title>
<p>
To construct the EGFP-lac-E2F1 and EGFP-lac-p53 fusion expression vectors, the PCR fragments that encode the E2F1 (aa 368–437) and p53 (aa 1–73), respectively, were cloned into the AscI site in the plasmid p3′SS d tb Cl EGFP AscI (NYE4) (A.C. Nye and A.S. Belmont, personal communication). The correct orientation of the inserts was identified by colony hybridization and confirmed by DNA sequencing. To construct the lac-BRCA1 plasmids, the sequence for lac repressor was first amplified by PCR from the plasmid NYE4. The lac sequence was cloned into the HindIII–NotI sites of pRC-CMV (Invitrogen), generating pRC-lac. Various BRCA1 fragments and the COBRA1 sequence were amplified by PCR and inserted into the unique AscI site of pRC-lac.
</p>
<p>
The chromatin unfolding experiments were performed as previously described (
<xref rid="bib43" ref-type="bibr">Tumbar et al., 1999</xref>
). Briefly, AO3_1 cells were transiently transfected with the lac expression vectors using the FuGENE 6 transfection reagent (Roche). The medium was changed 24 h after transfection and cells were immunostained 48 h after transfection. Cells grown on glass coverslips were fixed with 1.6% paraformaldehyde for 30 min in PBS, permeabilized with 0.2% Triton X-100 in PBS for 5 min, and blocked in 1% normal goat serum in PBS for 1 h. The coverslips were then incubated with primary antibodies at room temperature for 1 h, followed by incubation with the appropriate secondary antibodies for 1 h. Unless otherwise specified, a rabbit polyclonal anti–lac repressor antibody (Stratagene) and mouse monoclonal anti-FLAG antibody (Sigma-Aldrich) were applied at 1:20,000 dilution. The anti–acetylated histone H3 antibody was raised against di-acetylated H3 (Lys9 and Lys14) (
<xref rid="bib4" ref-type="bibr">Boggs et al., 1996</xref>
) (
<xref rid="bib20" ref-type="bibr">Lin et al., 1989</xref>
), a gift from Drs. C. Mizzen and C.D. Allis (University of Virginia, Charlottesville, VA). The secondary antibodies were goat anti–rabbit IgG-conjugated with Cy3 (Amersham), and horse anti–mouse IgG-conjugated with fluorescein isothiocyanate (FITC; Vector Laboratories).
</p>
<p>
For visualization of the nuclei, cells were stained with 0.2 μg/ml 4,6-diamidino-2-phenylindole (DAPI) for 5 min before mounting. Fluorescent images were acquired by a charged-coupled device camera (Hamamatsu ORCA) that was mounted on a Nikon Microphot-SA microscope and equipped with Improvision Openlab software. Confocal images were collected on a Zeiss LSM410 confocal microscope. Figs. were assembled using Adobe Photoshop (v. 5.5).
</p>
</sec>
<sec>
<title>Yeast two-hybrid screen</title>
<p>
To identify proteins that specifically interact with the BRCT1 repeat of BRCA1, the standard yeast two-hybrid screen was performed in the following manner. First, the bait plasmid was generated by inserting a PCR-amplified cDNA fragment encoding the BRCT1 sequence (aa 1642–1736) into the NdeI–EcoRI restriction sites of pAS2–1 (CLONTECH Laboratories, Inc.), resulting in an in-frame fusion with the GAL4 DNA-binding domain. The resultant plasmid, pAS2-BRCT1, and a human ovary cDNA prey library (CLONTECH Laboratories, Inc.) were sequentially transformed into the
<italic>S. cerevisiae</italic>
strain CG1945 according to the manufacturer's instructions (CLONTECH Laboratories, Inc.). Transformants were plated on synthetic medium lacking tryptophan, leucine and histidine but containing 1 mM 3-aminotriazole. Approximately 2.3 million transformants were screened. The candidate clones were retrieved from the yeast cells and reintroduced back to the same yeast strain to verify the interaction between the candidates and the BRCT1 bait. The specificity of the interaction was determined by comparing the interactions between the candidates and various bait constructs.
</p>
</sec>
<sec>
<title>Coimmunoprecipitation</title>
<p>
HEK293T cells were transfected using LipofectAmine 2000 (GIBCO BRL). 24 h after transfection, cells were washed twice with PBS and lysed in 0.5 ml lysis buffer (50 mM Hepes, pH 8, 250 mM NaCl, 0.1% NP-40, and protease inhibitor tablets from Roche). After brief sonication, the lysate was centrifuged at 16,000
<italic>g</italic>
for 12 min at 4°C. The supernatant was used for subsequent coimmunoprecipitation. 20 μl of the supernatant was used as crude extract for detecting protein expression level. 15 μl of a 50% slurry of the anti-FLAG agarose beads (Sigma-Aldrich) was used in each immunoprecipitation. Immunoprecipitation was performed overnight at 4°C. The beads were centrifuged at 3,300 rpm for 2 min, and washed three times with washing buffer (50 mM Hepes, pH8, 500 mM NaCl, 0.5% NP-40) and three times with RIPA buffer (50 mM Tris, pH 8.0, 150 mM NaCl, 1% NP-40, 0.1% SDS, and 0.5% sodium deoxycholate). Each wash was performed for at least 30 min. The precipitates were then eluted in 15 μl 2× SDS-PAGE sample buffer. Gel electrophoresis was followed by immunoblotting according to standard procedures.
</p>
</sec>
<sec>
<title>GST pulldown assay</title>
<p>
The PCR fragments encoding various BRCA1 fragments were cloned into pGEX-2T and the constructs were confirmed by sequencing. The GST-BRCA1 proteins were made and purified, with the induction of protein expression performed at 19°C overnight. pcDNA3 vector containing the COBRA1 gene was used for in vitro transcription and translation in the TnT Reticulocyte Lysate system (Promega). The
<sup>35</sup>
S-labeled COBRA1 was translated in vitro according to the manufacturer's instructions and mixed with 10 μg the GST-bound bead in 0.5 ml binding buffer (50 mM Tris-HCl, pH 7.5, 150 mM NaCl, 1 mM EDTA, 0.3 mM DTT, 0.1% NP-40 and protease inhibitor tablet). The binding reaction was performed at 4°C overnight and the beads were subsequently washed four times with washing buffer (same as binding buffer except 0.5% NP-40 was used), 30 min each time. The beads were eluted in 10 μl 2 × SDS-PAGE sample buffer and the proteins were resolved on 10% denaturing gel. The gel was then dried and exposed to x-ray films for overnight.
</p>
</sec>
</sec>

Answer 1

为html

很难知道该文档的“正确”之处是什么，因为它不完全是HTML。哦，我知道，第二行说明它是符合nlm-articleset-2.0.dtd的XML。有些XML解析器可能比BS4更合适，但是无论如何我们都会向前推。

假设我们将其压缩到更接近格式良好的HTML的位置：

soup = bs(r.content.replace('<sec', '<div').replace(' sec-type=', ' class='), 'lxml')
divs = soup.find_all('div')

然后，如果我们要求所有div，则divs[8]包含所需的内容。

这仅获得一个部分，

divs = soup.find_all('div', class_='materials|methods')

所以divs[0]有内容。

在一节中，您可能会发现查询<p>或<title>标签很有帮助。

为xml

ElementTree

BeautifulSoup是很棒，用于抓取浏览器网页。但这不是本文档的结构。让我们使用一个different technique，它根据该结构进行解析。

import xml.etree.ElementTree as et

root = et.fromstring(r.content)
for i, sec in enumerate(root.iter('sec')):
    if sec.attrib:
        print(i, sec.attrib)

8 {'sec-type': 'materials|methods'}

您可以继续从那里解析出片段。

xmltodict

您可能会发现xmltodict提供的简单API （$ pip install xmltodict）很适合这个项目。

Answer 2

我相信以下代码的输出类似于您为PMCID: 2150890提供的输出：

    pmc = soup.find_all('title',string=re.compile(r'method'))
    for i in pmc:
       print(i.parent)

从标签中识别单词“ method”并提取文本

2 个答案:

为html

为xml

ElementTree

xmltodict