如何在文本中找到复合字符串

时间:2018-02-13 22:07:57

标签: java regex string search text

我一直在寻找解决方案,以便在句子中找到类似set.seed(0) SP <- data.frame(Company = c(rep_len("Apple", 50), rep_len("Microsoft", 50)), Price = round(runif(100, 1, 2), 2), Date = rep(seq.Date(from = as.Date("2002-01-01"), length.out = 50, by = "month"), 2), Event = rbinom(100, 1, 0.05), stringsAsFactors = FALSE) Event <- which(SP$Event %in% 1) resultFrame <- data.frame(Period = (-10):15) for (i in Event){ Stock <- SP$Company[i] eventTime <- format(SP$Date[i], "%b-%Y") stockWin <- (i - 10):(i + 15) stockWin[stockWin <= 0 | stockWin > nrow(SP)] <- NA stockWin[!(SP$Company[stockWin] %in% Stock)] <- NA priceWin <- SP[stockWin, "Price"] eventName <- paste("Event", eventTime, Stock, sep=".") resultFrame <- data.frame(resultFrame, priceWin) names(resultFrame)[ncol(resultFrame)] <- eventName } 的字符串并将其从中移除。例如:

我们有一句话 - howareyou

复合 - Hello there, how are you? 因此,我希望使用此字符串 - how are you删除化合物。

我目前的解决方案是将字符串拆分为单词并检查复合词是否包含每个单词,但它不能正常工作,因为如果您有其他与该复合词匹配的单词,它们也将被删除,例如:

如果我们要在此字符串Hello there, ?中查找foreseenfuture,那么根据我的解决方案I have foreseen future for all of you也会被删除,因为它位于复合词内。

代码

for

那么,还有其他方法可以解决这个问题吗?

1 个答案:

答案 0 :(得分:0)

我会假设你复合时只删除空格。所以有了这个假设&#34;因为,看到了未来。为了看到未来&#34;会变成&#34;因为,看到未来。 &#34;因为逗号分解了其他化合物。在这种情况下,这应该工作:

    String example1 = "how are you?";
    String example2 = "how, are you... here?";
    String example3 = "Madam, how are you finding the accommodations?";
    String example4 = "how are you how are you how are you taco";

    String compound = "howareyou";

    StringBuilder compoundRegexBuilder = new StringBuilder();

    //This matches to a word boundary before the first word
    compoundRegexBuilder.append("\\b");

    // inserts each character into the regex
    for(int i = 0; i < compound.length(); i++) {
        compoundRegexBuilder.append(compound.charAt(i));

        // between each letter there could be any amount of whitespace
        if(i<compound.length()-1) {
            compoundRegexBuilder.append("\\s*");
        }
    }

    // Makes sure the last word isn't part of a larger word
    compoundRegexBuilder.append("\\b");

    String compoundRegex = compoundRegexBuilder.toString();
    System.out.println(compoundRegex);
    System.out.println("Example 1:\n" + example1 + "\n" + example1.replaceAll(compoundRegex, ""));
    System.out.println("\nExample 2:\n" + example2 + "\n" + example2.replaceAll(compoundRegex, ""));
    System.out.println("\nExample 3:\n" + example3 + "\n" + example3.replaceAll(compoundRegex, ""));
    System.out.println("\nExample 4:\n" + example4 + "\n"  + example4.replaceAll(compoundRegex, ""));

输出如下:

\bh\s*o\s*w\s*a\s*r\s*e\s*y\s*o\s*u\b
Example 1:
how are you?
?

Example 2:
how, are you... here?
how, are you... here?

Example 3:
Madam, how are you finding the accommodations?
Madam,  finding the accommodations?

Example 4:
how are you how are you how are you taco
   taco

您也可以使用它来匹配任何其他字母数字化合物。