我正在尝试从多级结构化XML文件中提取数据。输入文件将是
的搜索结果查询输出:
<?xml version="1.0" encoding="UTF-8"?>
<PubmedArticleSet>
<PubmedArticle>
<MedlineCitation Status="Publisher" Owner="NLM">
<PMID Version="1">24874852</PMID>
<DateCreated>
<Year>2014</Year>
<Month>5</Month>
<Day>30</Day>
</DateCreated>
<Article PubModel="Print-Electronic">
<Journal>
<ISSN IssnType="Electronic">1976-670X</ISSN>
<JournalIssue CitedMedium="Internet">
<PubDate>
<Year>2014</Year>
<Month>May</Month>
<Day>30</Day>
</PubDate>
</JournalIssue>
<Title>BMB reports</Title>
<ISOAbbreviation>BMB Rep</ISOAbbreviation>
</Journal>
<ArticleTitle>
Human selenium binding protein-1 (hSP56) is a negative regulator of HIF-1α and suppresses the malignant characteristics of prostate cancer cells.
</ArticleTitle>
<Pagination>
<MedlinePgn/>
</Pagination>
<ELocationID EIdType="pii">2831</ELocationID>
<Abstract>
<AbstractText NlmCategory="UNLABELLED">
In the present study, we demonstrate that ectopic expression of 56-kDa human selenium binding protein-1 (hSP56) in PC-3 cells that do not normally express hSP56 results in a marked inhibition of cell growth in vitro and in vivo. Down-regulation of hSP56 in LNCaP cells that normally express hSP56 results in enhanced anchorage-independent growth. PC-3 cells expressing hSP56 exhibit a significant reduction of hypoxia inducible protein (HIF)-1α protein levels under hypoxic conditions without altering HIF-1α mRNA (HIF1A) levels. Taken together, our findings strongly suggest that hSP56 plays a critical role in prostate cells by mechanisms including negative regulation of HIF-1α, thus identifying hSP56 as a candidate anti-oncogene product.
</AbstractText>
</Abstract>
<AuthorList>
<Author>
<LastName>Jeong</LastName>
<ForeName>Jee-Yeong</ForeName>
<Initials>JY</Initials>
<Affiliation>
Laboratory for Cell and Molecular Biology, Division of Hematology and Oncology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA; Department of Biochemistry and Cancer Research Institute, Kosin University College of Medicine, Busan, South Korea.
</Affiliation>
</Author>
<Author>
<LastName>Zhou</LastName>
<ForeName>Jin-Rong</ForeName>
<Initials>JR</Initials>
</Author>
<Author>
<LastName>Gao</LastName>
<ForeName>Chong</ForeName>
<Initials>C</Initials>
</Author>
<Author>
<LastName>Feldman</LastName>
<ForeName>Laurie</ForeName>
<Initials>L</Initials>
</Author>
<Author>
<LastName>Sytkowski</LastName>
<ForeName>Arthur J</ForeName>
<Initials>AJ</Initials>
</Author>
</AuthorList>
<Language>ENG</Language>
<PublicationTypeList>
<PublicationType>JOURNAL ARTICLE</PublicationType>
</PublicationTypeList>
<ArticleDate DateType="Electronic">
<Year>2014</Year>
<Month>5</Month>
<Day>30</Day>
</ArticleDate>
</Article>
<MedlineJournalInfo>
<MedlineTA>BMB Rep</MedlineTA>
<NlmUniqueID>101465334</NlmUniqueID>
<ISSNLinking>1976-6696</ISSNLinking>
</MedlineJournalInfo>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="entrez">
<Year>2014</Year>
<Month>5</Month>
<Day>31</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed">
<Year>2014</Year>
<Month>5</Month>
<Day>31</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2014</Year>
<Month>5</Month>
<Day>31</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>aheadofprint</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pii">2831</ArticleId>
<ArticleId IdType="pubmed">24874852</ArticleId>
</ArticleIdList>
</PubmedData>
</PubmedArticle>
</PubmedArticleSet>
我的目的是重新组织另一个网页中的数据。我正在尝试从这个结构的每一层提取数据。我正在使用正则表达式。 例如,如果我想从xml结构中提取抽象文本, 这是我正在使用的代码:
$o=urlencode("24874852");
$efetch = "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
db=pubmed&id=$o&retmode=xml&rettype=abstract&email=abc@xyz.com";
#echo $efetch;
$handle1 = file_get_contents($efetch,"r");
#echo $handle1s;
preg_match_all('/<AbstractText>\s*([0-9A-Za-z\.\_\n]+)\s*
<\/AbstractText>/s',$handle1,$abstext,PREG_PATTERN_ORDER)
foreach ($abstext[1] as $tiab){
echo $tiab; }`
我没有得到我期望的所需输出。知道哪里可能出错了吗?
答案 0 :(得分:1)
如果要从XML中提取文本,最好的选择是使用XML解析器,例如DOM解析器:
$document = new DOMDocument();
$document->load( "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=24874852&retmode=xml&rettype=abstract&email=abc@xyz.com" );
您可以使用XPath language选择要提取的数据://AbstractText
将返回一组所有 <AbstractText>
个节点。
您可以在解析的文档中使用PHP中的XPath:
$xpath = new DOMXpath($document);
获取您使用的所有节点:
$xpath->evaluate("//AbstractText")
要从每个节点提取文本,请使用nodeValue
:
foreach ($xpath->evaluate("//AbstractText") as $abstractText) {
echo $abstractText->nodeValue."\n";
}
请在此处查看使用您的数据的工作示例:http://codepad.viper-7.com/nlryKH