如何快速计算字符串在sting列表的给定部分中出现的频率?

时间:2015-09-25 09:46:13

标签: c# list optimization

我有一个字符串列表,我需要计算其中包含特定字符串的列表条目的数量(并且整个事项仅用于列表的子集而不是整个列表)。

下面的代码工作得很好但是它的性能是......遗憾的是不能在可接受的niveau中,因为我需要解析500k到900k的列表条目。对于这些条目,我需要运行下面的代码大约10k次(因为我需要分析列表的10k部分)​​。为此需要177秒甚至更多。所以我的问题是我怎么能这样做......快?

private int ExtraktNumbers(List<string> myList, int start, int end)
{
    return myList.Where((x, index) => index >= start && index <= end 
                        && x.Contains("MYNUMBER:")).Count();
}

4 个答案:

答案 0 :(得分:3)

现在我们知道你在这里调用方法10,00次是我的建议。我假设你有硬编码&#34;数字:&#34;这意味着你每次通话都在做不同的范围?所以如果是这样的话......

首先,运行索引&#39;方法并创建哪些索引匹配的列表。然后,您可以轻松计算所需范围的匹配。

注意:这很快,您甚至可以进一步优化它:

List<int> matchIndex = new List<int>();

void RunIndex(List<string> myList)
{
    for(int i = 0; i < myList.Count; i++)
    {
        if(myList[i].Contains("MYNUMBER:"))
        {
            matchIndex.Add(i);
        }
    }
}

int CountForRange(int start, int end)
{
    return matchIndex.Count(x => x >= start && x <= end);
}

然后您可以像这样使用,例如:

RunIndex(myList);

// I don't know what code you have here, this is just basic example.
for(int i = 0; i <= 10,000; i++)
{
    int count = CountForRange(startOfRange, endOfRange);
    // Do something with count.
}

此外,如果你检查的范围中有很多重复,那么你可以考虑在字典中缓存范围计数,但是在这个阶段很难判断这是否值得做。< / p>

答案 1 :(得分:2)

我很确定一个简单的迭代解决方案会表现得更好:

private int ExtractNumbers(List<string> myList, int start, int end)
{
    int count = 0;

    for (int i = start; i <= end; i++)
    {
        if (myList[i].Contains("MYNUMBER:"))
        {
            count++;
        }
    }

    return count;
}

答案 2 :(得分:1)

我的测试支持 10百万 10倍于)行

  var data = Enumerable
   .Range(1, 10000000)
   .Select(item => "123456789 bla-bla-bla " + "MYNUMBER:" + item.ToString())
   .ToList();

  Stopwatch sw = new Stopwatch();

  sw.Start();

  int result = ExtraktNumbers(data, 0, 10000000);

  sw.Stop();

我得到了这些结果:

2.78 秒 - 您最初的实施

天真循环( 2.60 秒):

private int ExtraktNumbers(List<string> myList, int start, int end) {
  int result = 0;

  for (int i = start; i < end; ++i)
    if (myList[i].Contains("MYNUMBER:"))
      result += 1;

  return result;
}

PLinq( 1.72 秒):

   private int ExtraktNumbers(List<string> myList, int start, int end) {
      return myList
        .AsParallel() // <- Do it in parallel
        .Skip(start - 1)
        .Take(end - start)
        .Where(x => x.Contains("MYNUMBER:"))
        .Count();
    }

明确的并行实现( 1.66 秒):

   private int ExtraktNumbers(List<string> myList, int start, int end) {
     long result = 0;

     Parallel.For(start, end, (i) => {
       if (myList[i].Contains("MYNUMBER:"))
         Interlocked.Increment(ref result);
     });

     return (int) result;
  }

我无法重现 177

答案 3 :(得分:0)

如果你从一开始就知道你想要考虑的间隔,那么循环列表可能是个好主意,就像上面提到的Dmytro和musefan所做的那样,所以我不再重复同样的想法了。

但是我对性能改进有不同的建议。你如何创建你的清单?你知道提前的物品数量吗?因为对于这么大的列表,您可以使用""" Django settings for mysite project. Generated by 'django-admin startproject' using Django 1.8.4. For more information on this file, see https://docs.djangoproject.com/en/1.8/topics/settings/ For the full list of settings and their values, see https://docs.djangoproject.com/en/1.8/ref/settings/ """ # Build paths inside the project like this: os.path.join(BASE_DIR, ...) import os BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) # Quick-start development settings - unsuitable for production # See https://docs.djangoproject.com/en/1.8/howto/deployment/checklist/ # SECURITY WARNING: keep the secret key used in production secret! SECRET_KEY = 'n^63(%(va-3wb9l!!2-vg003f)s(3g=%w1*%tv2(8%l)65g&a2' # SECURITY WARNING: don't run with debug turned on in production! DEBUG = True ALLOWED_HOSTS = [] # Application definition INSTALLED_APPS = ( 'django.contrib.admin', 'django.contrib.auth', 'django.contrib.contenttypes', 'django.contrib.sessions', 'django.contrib.messages', 'django.contrib.staticfiles', 'QNA', 'BTS', 'accounts', 'widget_tweaks', ) MIDDLEWARE_CLASSES = ( 'django.contrib.sessions.middleware.SessionMiddleware', 'django.middleware.common.CommonMiddleware', 'django.middleware.csrf.CsrfViewMiddleware', 'django.contrib.auth.middleware.AuthenticationMiddleware', 'django.contrib.auth.middleware.SessionAuthenticationMiddleware', 'django.contrib.messages.middleware.MessageMiddleware', 'django.middleware.clickjacking.XFrameOptionsMiddleware', 'django.middleware.security.SecurityMiddleware', ) ROOT_URLCONF = 'mysite.urls' TEMPLATES = [ { 'BACKEND': 'django.template.backends.django.DjangoTemplates', 'DIRS': [], 'APP_DIRS': True, 'OPTIONS': { 'context_processors': [ 'django.template.context_processors.debug', 'django.template.context_processors.request', 'django.contrib.auth.context_processors.auth', 'django.contrib.messages.context_processors.messages', ], }, }, ] WSGI_APPLICATION = 'mysite.wsgi.application' # Database # https://docs.djangoproject.com/en/1.8/ref/settings/#databases DATABASES = { 'default': { 'ENGINE': 'django.db.backends.sqlite3', 'NAME': os.path.join(BASE_DIR, 'db.sqlite3'), } } # Internationalization # https://docs.djangoproject.com/en/1.8/topics/i18n/ LANGUAGE_CODE = 'ko-kr' TIME_ZONE = 'UTC' USE_I18N = True USE_L10N = True USE_TZ = True # Static files (CSS, JavaScript, Images) # https://docs.djangoproject.com/en/1.8/howto/static-files/ STATIC_URL = '/static/' constructor that takes the initial capacity来提升性能。