Question

<p><a href=\"https://news.yahoo.com/during-siege-orlando-gunman-told-police-islamic-soldier-034552865.html\"><img src=\"https://s1.yimg.com/bt/api/res/1.2/1aLfwfzLVx7.osxsV87uog--/YXBwaWQ9eW5ld3NfbGVnbztmaT1maWxsO2g9ODY7cT03NTt3PTEzMA--/http://media.zenfs.com/en_us/News/Reuters/2016-06-20T125326Z_1_LYNXNPEC5J0TN_RTROPTP_2_FLORIDA-SHOOTING.JPG\" width=\"130\" height=\"86\" alt=\"***A woman mourns as she sits on the ground and takes part in a vigil for the Pulse night club victims following last week&#039;s shooting in Orlando\" align=\"left\" title=\"**A woman mourns as she sits on the ground and takes part in a vigil for the Pulse night club victims following last week&#039;s shooting in Orlando***\"** border=\"0\" /></a>The Florida nightclub killer called himself an &quot;Islamic soldier&quot; and threatened to strap hostages into explosive vests in calls with police during the three-hour siege, according to transcripts released by the FBI on Monday. In a first call he made to a 911 emergency operator, Mateen said &quot;I pledge allegiance to Abu Bakr al-Baghdadi, may God protect him, on behalf of the Islamic State,&quot; referring to the head of Islamic State. The FBI and U.S. State Department released partial transcripts of the four calls with the emergency operator and crisis negotiators earlier on Monday, omitting the shooter&#039;s references to the leader of Islamic State, saying they did not want to provide a platform for propaganda.</p><br clear=\"all\"/>

对于使用正则表达式的上述HTML标记，我正在删除HTML标记并仅获取有关新闻的描述。在那个描述中，即＆＃34;一位女士坐在地上并为上周在奥兰多拍摄的Pulse夜总会受害者参加守夜活动后哀悼＃34; 这个引用的部分丢失了。如何获得这些数据？

这是我用来获取描述的正则表达式

String news_description = item_obj.getString("description");

String news_description_noHTMLString = news_description.replaceAll("\\<.*?>","");

有谁能建议我怎么做？

Answer 1

不是完美的解决方案，但在大多数情况下都有效。

    Pattern p = Pattern.compile("(alt|title).*?\"(.*?)\"");
    Matcher m = p.matcher(news);
    while (m.find()) {
        System.out.printf("%s: %s\n",m.group(1), m.group(2));
    }

为了更完美，你应该只将模式应用于标签内部而不是整个文本。

Answer 2

使用正则表达式捕获HTML文本不是一个好主意，因为内联CSS和JAVASCRIPT的HTML语法可能很复杂。所以你必须编写更复杂的正则表达式。它使用JAVA中的HTML解析器（https://jsoup.org/）的最佳方式。

Document doc = Jsoup.parse("<tag1><tag2>text</tag1></tag2>");    
String ownText = doc.body().ownText();
String text = doc.body().text();    
System.out.println(ownText);
System.out.println(text);

但是，如果您确实需要使用正则表达式，并且修复了HTML格式，则可以使用此正则表达式捕获</a>和</p>之间的文本：

.*<\/a>(.*)<\/p>

在此处试试：https://regex101.com/r/yN8wJ9/1

如何使用正则表达式获取HTML标记中的字符串数据

2 个答案: