Question

我想从解析的文本中解析名词短语（NN，NNP，NNS，NNPS）。 E.g：

Input sentence -
John/NNP
works/VBZ
in/IN
oil/NN
industry/NN
./.
Output: John Oil Industry

我对逻辑感到困惑，因为我需要搜索/NN，/NNP，/NNS和/NNPS等字符串，然后在其前面打印前一个字。使用C或C ++解析名词短语的逻辑是什么？

我自己的尝试如下：

char* SplitString(char* str, char sep 
{
    return str;
}
main()
{
    char* input = "John/NNP works/VBZ in/IN oil/NN industry/NN ./.";
    char *output, *temp;
    char * field;
    char sep = '/NNP';
    int cnt = 1;
    output = SplitString(input, sep);

    field = output;
    for(temp = field; *temp; ++temp){ 
       if (*temp == sep){
          printf(" %.*s\n", temp-field, field);
          field = temp+1;
       }
    }
    printf("%.*s\n", temp-field, field);
}

我的修改如下：

#include <regex>
#include <iostream>

int main()
{
    const std::string s = "John/NNP works/VBZ in/IN oil/NNS industry/NNPS ./.";
    std::regex rgx("(\\w+)\/NN[P-S]{0,2}");
    std::smatch match;

    if (std::regex_search(s.begin(), s.end(), match, rgx))
        std::cout << " " << match[1] << '\n';
}

我得到的输出只是“约翰”。其他/ NNS标签未来。

我的第二种方法：

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>

char** str_split(char* a_str, const char a_delim)
{
    char** result = 0;
    size_t count = 0;
    char* tmp = a_str;
    char* last_comma = 0;
    char delim[2];
    delim[0] = a_delim;
    delim[1] = 0;

    /* Count how many elements will be extracted. */
    while (*tmp)
    {
        if (a_delim == *tmp)
        {
            count++;
            last_comma = tmp;
        }
        tmp++;
    }

    /* Add space for trailing token. */
    count += last_comma < (a_str + strlen(a_str) - 1);

    /* Add space for terminating null string so caller
       knows where the list of returned strings ends. */
    count++;

    result = malloc(sizeof(char*) * count);

    if (result)
    {
        size_t idx  = 0;
        char* token = strtok(a_str, delim);

        while (token)
        {
            assert(idx < count);
            *(result + idx++) = strdup(token);
            token = strtok(0, delim);
        }
        assert(idx == count - 1);
        *(result + idx) = 0;
    }

    return result;
}

int main()
{
    char text[] = "John/NNP works/VBZ in/IN oil/NN industry/NN ./.";
    char** tokens;

    //printf("INPUT SENTENCE=[%s]\n\n", text);

    tokens = str_split(text, '');

    if (tokens)
    {
        int i;
        for (i = 0; *(tokens + i); i++)
        {
            printf("[%s]\n", *(tokens + i));
            free(*(tokens + i));
        }
        printf("\n");
        free(tokens);
    }

    return 0;
}

输出是：

[John/NNP]
[works/VBZ]
[in/IN]
[oil/NN]
[industry/NN]
[./.]

我只想要/NNP和/NN个解析数据，例如John，oil和industry。怎么弄这个？正则表达式有帮助吗？如何在C中使用与C ++相同的正则表达式？

Answer 1

如果全部是关于打印，那么尝试这种方法。它在搜索功能中使用regular expression来查找是否存在/ NN后跟0到3个大写字母的模式\/NN[A-Z]{0,3}并在之前捕获() \\w+个字词它

这是未经测试的：

#include <regex>
#include <iostream>

int main()
{
    const std::string s = "John/NNP works/VBZ in/IN oil/NN industry/NN ./.";
    std::regex rgx("(\\w+)\/NN[A-Z]{0,3}");
    std::smatch match;

    while (std::regex_search(s, match, rgx))
        std::cout << "match: " << match[1] << '\n';
}

Answer 2

regex_token_iterator可能会有所帮助

   std::string input = "John/NNP works/VBZ in/IN oil/NN industry/NN ABC/NNPS ./.";

   // This regex has a capture group () that is looking for a sequence of word characters
   // followed by /NN which is not captured but just matched
   std::regex nouns_re("(\\w+)\\/NN");

   // We pass 1 as the final argument to the token iterator 
   // because we just want to print the word captured and not the /NN part
   std::copy( std::sregex_token_iterator(input.begin(), input.end(), nouns_re, 1),
              std::sregex_token_iterator(),
              std::ostream_iterator<std::string>(std::cout, "\n")
        );

Answer 3

您的行“为尾随令牌添加空间”是不必要的，因为strtok将自动在终止零结束。

此外，tokens = str_split(text, '');无法正确，因为您的str_split需要a_delim的字符，并且您使用''来填充它，这在我的编译器（Clang）问题上错误

error: empty character constant

据推测，你打算拆分一个空间' '，但我没有测试它本身是否有效。（即使你得到某种形式的输出。）

您的代码返回结果[John/NNP]（等），因为您没有做任何其他事情来拆分标记名称，而且您也没有针对您的愿望标记列表进行测试。 C程序只执行你告诉它的 - 这就是编程的目的。

我在普通C中提出了一个简单的解决方案，使用字符串标记功能strtok，单字符查找strchr和字符串比较strcmp。

我的例程标记空格上的输入字符串，在空格上一次分割一个单词（注意：为此，strtok需要能够修改输入字符串），在此标记中找到斜杠，将斜杠后面的文本与所需短语列表进行比较，输出之前的单词如果它在列表中则斜杠。

在每次strtok调用之后，指针token将指向下一个字的开头，该字已经为零终止。因此，第一个令牌将是John/NNP。
然后strchr尝试查找斜杠，如果找到，则会将其位置放入slash。
如果成功，slash指向斜线本身;因此，测试标记应位于slash+1。
一个简单的循环将其与wanted列表中的每个标记名称进行比较。如果找到，则*slash设置为0，覆盖斜杠，因此当前标记字符串在其之前结束。然后输出当前令牌。
无论是否找到，都会在循环中再次调用strtok，直到失败为止。如果它成功找到下一个令牌，它将回滚到＃2，否则退出。

#include <stdio.h>
#include <string.h>

int main()
{
    /* input */
    char text[] = "John/NNP works/VBZ in/IN oil/NN industry/NN ./.";
    char *wanted[] = { "NN", "NNP", "NNS", "NNPS" };

    /* helper variables */
    size_t i;
    char *token, *slash;

    token = strtok(text, " ");
    while (token)
    {
        slash = strchr (token, '/');
        if (slash && slash[1])
        {
            for (i=0; i<sizeof(wanted)/sizeof(wanted[0]); i++)
            {
                if (!strcmp (slash+1, wanted[i]))
                {
                    *slash = 0;
                    printf ("%s\n", token);
                    break;
                }
            }
        }
        token = strtok(NULL, " ");
    }

    return 0;
}

该计划的输出：

John
oil
industry

我没有按照你想要的输出来大写单词。这是一个简单的附录，你应该能够自己解决这个问题。

使用C / C ++从解析的文本中解析名词短语

3 个答案: