Question

我希望根据特定的字符序列来分割刺痛，但前提是它们是有序的。

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main()
{
  int i = 0;
  char **split;
  char *tmp;

  split = malloc(20 * sizeof(char *));
  tmp  = malloc(20 * 12 * sizeof(char));
  for(i=0;i<20;i++)
  {
    split[i] = &tmp[12*i];
  }

  char *line;
  line = malloc(50 * sizeof(char));

  strcpy(line, "Test - Number -> <10.0>");
  printf("%s\n", line);
  i = 0;

  while( (split[i] = strsep(&line, " ->")) != NULL)
  {
    printf("%s\n", split[i]);
    i++;
  }
}

这将打印出来：

Test 
Number
<10.0

但是我只想分开 - ＆gt;所以它可以给出输出：

Test - Number
<10.0>

Answer 1

我认为使用有序的延迟序列进行拆分的最佳方法是使用strtok_r复制strstr行为，如下所示：

#include <stdio.h>
#include <string.h>

char *substrtok_r(char *str, const char *substrdelim, char **saveptr)
{
    char *haystack;

    if(str)
        haystack = str;
    else
        haystack = *saveptr;

    char *found = strstr(haystack, substrdelim);

    if(found == NULL)
    {
        *saveptr = haystack + strlen(haystack);
        return *haystack ? haystack : NULL;
    }

    *found = 0;
    *saveptr = found + strlen(substrdelim);

    return haystack;
}


int main(void)
{
    char line[] = "a -> b -> c -> d; Test - Number -> <10.0> ->No->split->here";

    char *input = line;
    char *token;
    char *save;

    while(token = substrtok_r(input, " ->", &save))
    {
        input = NULL;
        printf("token: '%s'\n", token);
    }

    return 0;
}

此行为与strtok_r类似，但仅在找到子字符串时才会拆分。该输出是：

$ ./a 
token: 'a'
token: ' b'
token: ' c'
token: ' d; Test - Number'
token: ' <10.0>'
token: 'No->split->here'

与strtok和strtok_r一样，它要求源字符串为可修改的，因为它写入'\0' - 终止字节用于创建和返回令牌。

修改

嗨，你介意解释为什么'*found = 0'意味着返回值只是字符串中间的分隔符。我真的不明白这里发生了什么或为什么会这样。谢谢

你必须要了解的第一件事就是字符串在C中是如何工作的。字符串是只是一个以'\0'结尾的字节（字符）序列 - 终止字节。我在括号中写了字节和字符，因为C中的字符是只是一个1字节的值（在大多数系统上，一个字节是8位长）和整数表示字符的值是ASSCI代码中定义的值表，它是7位长的值。从表中可以看出值97表示字符'a'，98表示'b'等。写作

char x = 'a';

与做
相同
char x = 97;

值0是字符串的特殊值，称为NUL（空字符）或'\0' - 终止字节。该值用于告诉函数的位置字符串结束。像strlen这样的函数返回字符串的长度通过计算遇到的字节数直到它遇到一个字节值0。

这就是使用char数组存储字符串的原因，因为指向数组的指针给出了存储char s序列的内存块的开始。

让我们来看看：

char string[] = { 'H', 'e', 'l', 'l', 'o', 0, 48, 49, 50, 0 };

此阵列的内存布局为

0 1 2 3 4 5 6 7 8 9 +-----+-----+-----+-----+-----+----+-----+-----+-----+----+ | 'H' | 'e' | 'l' | 'l' | 'o' | \0 | '0' | '1' | '2' | \0 | +-----+-----+-----+-----+-----+----+-----+-----+-----+----+

或者更准确地说是整数值

0 1 2 3 4 5 6 7 8 9 10 +----+-----+-----+-----+-----+---+----+----+----+---+ | 72 | 101 | 108 | 108 | 111 | 0 | 48 | 49 | 50 | 0 | +----+-----+-----+-----+-----+---+----+----+----+---+

注意，值0表示'\0'，48表示'0'，49表示 '1'和50代表'2'。如果你这样做

printf("%lu\n", strlen(string));

输出为5. strlen将在第5个位置找到值0 停止计数，但string存储两个字符串，因为从第6开始位置打开，一个新的字符序列开始，也以0结束，从而使它成为一个数组中的第二个有效字符串。要访问它，您需要有指针指出超过前0值。

printf("1. %s\n", string); printf("2. %s\n", string + strlen(string) + 1);

输出为

Hello 012

此属性用于strtok（以及我的上面）等函数以返回给您来自较大字符串的子字符串，无需创建副本（即可创建一个新数组，动态分配内存，使用strcpy创建副本）。

假设你有这个字符串：

char line[] = "This is a sentence;This is another one";

这里只有一个字符串，因为'\0' - 终止字节在后面字符串中的最后一个'e'。如果我这样做：

line[18] = 0; // same as line[18] = '\0';

然后我在同一个数组中创建了两个字符串：

"This is a sentence\0This is another one"

因为我用';'替换了分号'\0'，因此创建了一个新字符串从第0位到第18位，第二位从第19位到第38位。如果我现在这样做

printf("string: %s\n", line);

输出将是

string: This is a sentence

现在让我们看看函数本身：

char *substrtok_r(char *str, const char *substrdelim, char **saveptr);

第一个参数是源字符串，第二个参数是分隔符字符串，第三个是char的doule指针。你必须传递一个指针到指针char。这将用于记住函数应该在哪里接下来继续扫描，稍后再进行扫描。

这是算法：

if str is not NULL: start a new scan sequence from str otherwise resume scanning from string pointed to by *saveptr found position of substring_d pointed to by 'substrdelim' if no such substring_d is found if the current character of the scanned text is \0 no more substrings to return --> return NULL otherwise return the scanned text and set *saveptr to point to the \0 character of the scanned text, so that the next iteration ends the scanning by returning NULL otherwise (a substring_d was found) create a new substring_a until the found one by setting the first character of the found substring_d to 0. update *saveptr to the start of the found substring_d plus it's previous length so that *saveptr points to the past the delimiter sequence found in substring_d. return new created substring_a

这第一部分很容易理解：

if(str) haystack = str; else haystack = *saveptr;

如果str不是NULL，则需要启动新的扫描序列。这就是原因在main中，input指针设置为指向保存的字符串的开头在line。每隔一次迭代必须与str == NULL一起调用为什么在while循环中完成的第一件事就是设置input = NULL; substrtok_r使用*saveptr恢复扫描。这是标准 strtok的行为。

下一步是寻找分隔子字符串：

char *found = strstr(haystack, substrdelim);

下一部分处理没有分隔子字符串的情况结果²：

if(found == NULL) { *saveptr = haystack + strlen(haystack); return *haystack ? haystack : NULL; }

*saveptr更新为指向整个源代码，以便指向 '\0' - 终止字节。返回行可以重写为

if(*haystack == '\0') return NULL else return haystack;

表示如果源已经是一个empy字符串¹，则返回 NULL。这意味着找不到更多子字符串，结束调用该函数。这个也是strtok的标准行为。

最后一部分

*found = 0; *saveptr = found + strlen(substrdelim); return haystack;

处理找到分隔子字符串的情况。这里

*found = 0;

基本上正在做

found[0] = '\0';

如上所述创建子串。
之前再次说清楚
在

*found = 0; *saveptr = found + strlen(substrdelim); return haystack;

内存看起来像这样：

+-----+-----+-----+-----+-----+-----+ | 'a' | ' ' | '-' | '>' | ' ' | 'b' | ... +-----+-----+-----+-----+-----+-----+ ^ ^ | | haystack found *saveptr

在

*found = 0; *saveptr = found + strlen(substrdelim);

内存看起来像这样：

+-----+------+-----+-----+-----+-----+ | 'a' | '\0' | '-' | '>' | ' ' | 'b' | ... +-----+------+-----+-----+-----+-----+ ^ ^ ^ | | | haystack found *saveptr because strlen(substrdelim) is 3

请记住，此时我printf("%s\n", haystack); '-'，因为a 找到已设置为0，它将打印*found = 0。 strtok创建了两个字符串一个像上面提到的那样。 strtok（和我的功能基于 return haystack;）使用相同的技术。所以当函数执行时

token

substrtok_r中的第一个字符串将是拆分前的标记。终于 NULL返回substrtok_r并且循环存在，因为NULL返回 '\0'当无法再创建拆分时，就像strtok。
一样
<强> Fotenotes

¹空字符串是第一个字符已经是的字符串 strstr - 终止字节。

²这是非常重要的部分。 C中的大部分标准功能像strstr这样的库不会在内存中返回一个新字符串不创建副本并返回副本（除非文档说明）。该会返回一个指向原件的指针加上一个偏移量。

成功const char *txt = "abcdef"; char *p = strstr(txt, "cd");将返回指向子字符串开头的指针，该指针将位于源的偏移处。

strstr

此处"cd"将返回指向子字符串"abcdef"的开头的指针 p - txt。要获得偏移，请执行b = base address where txt is pointing to b b+1 b+2 b+3 b+4 b+5 b+6 +-----+-----+-----+-----+-----+-----+------+ | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | '\0' | +-----+-----+-----+-----+-----+-----+------+ ^ ^ | | txt p，返回多少字节有appart

txt

因此b指向p，b+2指向地址p-txt。这就是你得到的原因执行(b+2) - b => 2的偏移量为p。所以*found = 0;指向原始地址加上2个字节的偏移量。因为这个bahaviour txt + 2之类的东西首先起作用。

请注意，执行txt之类的操作会返回指向的新指针其中char点加上偏移量为2.这称为指针算法。它类似于regualr算术，但这里编译器采用对象的大小考虑到。 sizeof(char)是一种定义为大小为1的类型，因此int arr[] = { 7, 2, 1, 5 };返回1.但是，假设你有一个整数数组：

int

在我的系统上，int的大小为4，因此b = base address where arr is stored address base base + 4 base + 8 base + 12 in bytes +-----------+-----------+-----------+-----------+ | 7 | 2 | 1 | 5 | +-----------+-----------+-----------+-----------+ pointer arr arr + 1 arr + 2 arr + 3 arithmetic对象在内存中需要4个字节。这个数组在内存中看起来像这样：

arr + 1

这里arr会返回一个指向{{1}}存储位置的指针加上一个偏移量为4个字节。

基于连续分隔符拆分字符串

1 个答案: