给定一个字符串s
,生成一组所有唯一子串的最快方法是什么?
示例:对于str = "aba"
,我们会得到substrs={"a", "b", "ab", "ba", "aba"}
。
朴素算法将遍历整个字符串,在每次迭代中生成长度为1..n
的子字符串,产生O(n^2)
上限。
可能有更好的约束吗?
(这是技术上的功课,所以也欢迎指针)
答案 0 :(得分:41)
正如其他海报所说,给定字符串可能存在O(n ^ 2)子串,因此打印它们不能比这更快。然而,存在可以在线性时间内构建的集合的有效表示:the suffix tree。
答案 1 :(得分:13)
没有办法比O(n 2 )更快地执行此操作,因为字符串中总共有O(n 2 )子串,所以如果你必须生成它们全部,在最坏的情况下它们的数量将是n(n + 1) / 2
,因此上限 O的下限(n 2 )Ω(n 2 )。
答案 2 :(得分:6)
第一个是蛮力,其复杂度为O(N ^ 3),可以降低到O(N ^ 2 log(N)) 第二个使用具有复杂度O(N ^ 2)的HashSet 第三个使用LCP,最初找到给定字符串的所有后缀,其中最坏情况为O(N ^ 2),最佳情况为O(N Log(N))。
第一个解决方案: -
import java.util.Scanner;
public class DistinctSubString {
public static void main(String[] args) {
Scanner in = new Scanner(System.in);
System.out.print("Enter The string");
String s = in.nextLine();
long startTime = System.currentTimeMillis();
int L = s.length();
int N = L * (L + 1) / 2;
String[] Comb = new String[N];
for (int i = 0, p = 0; i < L; ++i) {
for (int j = 0; j < (L - i); ++j) {
Comb[p++] = s.substring(j, i + j + 1);
}
}
/*
* for(int j=0;j<N;++j) { System.out.println(Comb[j]); }
*/
boolean[] val = new boolean[N];
for (int i = 0; i < N; ++i)
val[i] = true;
int counter = N;
int p = 0, start = 0;
for (int i = 0, j; i < L; ++i) {
p = L - i;
for (j = start; j < (start + p); ++j) {
if (val[j]) {
//System.out.println(Comb[j]);
for (int k = j + 1; k < start + p; ++k) {
if (Comb[j].equals(Comb[k])) {
counter--;
val[k] = false;
}
}
}
}
start = j;
}
System.out.println("Substrings are " + N
+ " of which unique substrings are " + counter);
long endTime = System.currentTimeMillis();
System.out.println("It took " + (endTime - startTime) + " milliseconds");
}
}
第二种解决方案: -
import java.util.*;
public class DistictSubstrings_usingHashTable {
public static void main(String args[]) {
// create a hash set
Scanner in = new Scanner(System.in);
System.out.print("Enter The string");
String s = in.nextLine();
int L = s.length();
long startTime = System.currentTimeMillis();
Set<String> hs = new HashSet<String>();
// add elements to the hash set
for (int i = 0; i < L; ++i) {
for (int j = 0; j < (L - i); ++j) {
hs.add(s.substring(j, i + j + 1));
}
}
System.out.println(hs.size());
long endTime = System.currentTimeMillis();
System.out.println("It took " + (endTime - startTime) + " milliseconds");
}
}
第三种解决方案: -
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Arrays;
public class LCPsolnFroDistinctSubString {
public static void main(String[] args) throws IOException {
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
System.out.println("Enter Desired String ");
String string = br.readLine();
int length = string.length();
String[] arrayString = new String[length];
for (int i = 0; i < length; ++i) {
arrayString[i] = string.substring(length - 1 - i, length);
}
Arrays.sort(arrayString);
for (int i = 0; i < length; ++i)
System.out.println(arrayString[i]);
long num_substring = arrayString[0].length();
for (int i = 0; i < length - 1; ++i) {
int j = 0;
for (; j < arrayString[i].length(); ++j) {
if (!((arrayString[i].substring(0, j + 1)).equals((arrayString)[i + 1]
.substring(0, j + 1)))) {
break;
}
}
num_substring += arrayString[i + 1].length() - j;
}
System.out.println("unique substrings = " + num_substring);
}
}
第四种解决方案: -
public static void printAllCombinations(String soFar, String rest) {
if(rest.isEmpty()) {
System.out.println(soFar);
} else {
printAllCombinations(soFar + rest.substring(0,1), rest.substring(1));
printAllCombinations(soFar , rest.substring(1));
}
}
Test case:- printAllCombinations("", "abcd");
答案 3 :(得分:3)
对于大哦......你能做到的最好是O(n ^ 2)
不需要重新发明轮子,它不是基于弦乐,而是基于一组,所以你必须采取概念并将它们应用到你自己的情况。
算法
答案 4 :(得分:2)
好吧,因为有可能n*(n+1)/2
个不同的子串(空子串+1),我怀疑你可能比O(n*2)
更好(最糟糕的情况)。最简单的方法是生成它们并使用一些漂亮的O(1)
查找表(例如hashmap)来找到它们时的重复项。
答案 5 :(得分:1)
class program
{
List<String> lst = new List<String>();
String str = "abc";
public void func()
{
subset(0, "");
lst.Sort();
lst = lst.Distinct().ToList();
foreach (String item in lst)
{
Console.WriteLine(item);
}
}
void subset(int n, String s)
{
for (int i = n; i < str.Length; i++)
{
lst.Add(s + str[i].ToString());
subset(i + 1, s + str[i].ToString());
}
}
}
答案 6 :(得分:1)
WebBrowser t = c as WebBrowser;
if(t == null)
continue;
答案 7 :(得分:0)
只能在o(n ^ 2)时间内完成,因为字符串的唯一子串的总数将为n(n + 1)/ 2.
示例:
string s =“abcd”
传递0 :(所有字符串长度为1)
a,b,c,d = 4个字符串
传递1 :(所有字符串的长度均为2)
ab,bc,cd = 3字符串
传递2 :(所有字符串长度为3)
abc,bcd = 2个字符串
传递3 :(所有字符串长度为4)
abcd = 1个字符串
使用这个类比,我们可以编写具有o(n ^ 2)时间复杂度和恒定空间复杂度的解决方案。
源代码如下:
#include<stdio.h>
void print(char arr[], int start, int end)
{
int i;
for(i=start;i<=end;i++)
{
printf("%c",arr[i]);
}
printf("\n");
}
void substrings(char arr[], int n)
{
int pass,j,start,end;
int no_of_strings = n-1;
for(pass=0;pass<n;pass++)
{
start = 0;
end = start+pass;
for(j=no_of_strings;j>=0;j--)
{
print(arr,start, end);
start++;
end = start+pass;
}
no_of_strings--;
}
}
int main()
{
char str[] = "abcd";
substrings(str,4);
return 0;
}
答案 8 :(得分:0)
这是我在Python中的代码。它生成任何给定字符串的所有可能的子串。
def find_substring(str_in):
substrs = []
if len(str_in) <= 1:
return [str_in]
s1 = find_substring(str_in[:1])
s2 = find_substring(str_in[1:])
substrs.append(s1)
substrs.append(s2)
for s11 in s1:
substrs.append(s11)
for s21 in s2:
substrs.append("%s%s" %(s11, s21))
for s21 in s2:
substrs.append(s21)
return set(substrs)
如果将str_ =“abcdef”传递给该函数,则会生成以下结果:
a,ab,abc,abcd,abcde,abcdef,abcdf,abce,abcef,abcf,abd,abde,abdef,abdf,abe,abef,abf,ac,acd,acde,acdef,acdf,ace,acef ,acf,ad,ade,adef,adf,ae,aef,af,b,bc,bcd,bcde,bcdef,bcdf,bce,bcef,bcf,bd,bde,bdef,bdf,be,bef,bf,c ,cd,cde,cdef,cdf,ce,cef,cf,d,de,def,df,e,ef,f
答案 9 :(得分:0)
朴素算法需要O(n ^ 3)时间而不是O(n ^ 2)时间。 存在O(n ^ 2)个子串。 如果你放置O(n ^ 2)个子串,例如set, 然后设置比较每个字符串的O(lgn)比较,以检查集合中是否存在alrady。 此外,字符串比较需要O(n)时间。 因此,如果使用set,则需要O(n ^ 3 lgn)时间。如果使用哈希表而不是设置,则可以减少O(n ^ 3)时间。
重点是字符串比较而非数字比较。
因此,如果使用后缀数组和最长公共前缀(LCP)算法,最好的算法之一就是说,它会减少此问题的O(n ^ 2)时间。 使用O(n)时间算法构建后缀数组。 LCP的时间= O(n)时间。 由于对于后缀数组中的每对字符串,请执行LCP,因此总时间为O(n ^ 2)时间,以查找不同子字符串的长度。
此外,如果您想打印所有不同的子串,则需要O(n ^ 2)时间。
答案 10 :(得分:0)
这会打印唯一的子串。 https://ideone.com/QVWOh0
def uniq_substring(test):
lista=[]
[lista.append(test[i:i+k+1]) for i in range(len(test)) for k in
range(len(test)-i) if test[i:i+k+1] not in lista and
test[i:i+k+1][::-1] not in lista]
print lista
uniq_substring('rohit')
uniq_substring('abab')
['r', 'ro', 'roh', 'rohi', 'rohit', 'o', 'oh', 'ohi', 'ohit', 'h',
'hi', 'hit', 'i', 'it', 't']
['a', 'ab', 'aba', 'abab', 'b', 'bab']
答案 11 :(得分:0)
使用后缀数组和最长公共前缀尝试此代码。它还可以为您提供唯一子串的总数。代码可能会在visual studio中产生堆栈溢出,但在Eclipse C ++中运行良好。那是因为它返回了函数的向量。 Haven没有对极长的琴弦进行测试。将这样做并报告回来。
// C++ program for building LCP array for given text
#include <bits/stdc++.h>
#include <vector>
#include <string>
using namespace std;
#define MAX 100000
int cum[MAX];
// Structure to store information of a suffix
struct suffix
{
int index; // To store original index
int rank[2]; // To store ranks and next rank pair
};
// A comparison function used by sort() to compare two suffixes
// Compares two pairs, returns 1 if first pair is smaller
int cmp(struct suffix a, struct suffix b)
{
return (a.rank[0] == b.rank[0])? (a.rank[1] < b.rank[1] ?1: 0):
(a.rank[0] < b.rank[0] ?1: 0);
}
// This is the main function that takes a string 'txt' of size n as an
// argument, builds and return the suffix array for the given string
vector<int> buildSuffixArray(string txt, int n)
{
// A structure to store suffixes and their indexes
struct suffix suffixes[n];
// Store suffixes and their indexes in an array of structures.
// The structure is needed to sort the suffixes alphabatically
// and maintain their old indexes while sorting
for (int i = 0; i < n; i++)
{
suffixes[i].index = i;
suffixes[i].rank[0] = txt[i] - 'a';
suffixes[i].rank[1] = ((i+1) < n)? (txt[i + 1] - 'a'): -1;
}
// Sort the suffixes using the comparison function
// defined above.
sort(suffixes, suffixes+n, cmp);
// At his point, all suffixes are sorted according to first
// 2 characters. Let us sort suffixes according to first 4
// characters, then first 8 and so on
int ind[n]; // This array is needed to get the index in suffixes[]
// from original index. This mapping is needed to get
// next suffix.
for (int k = 4; k < 2*n; k = k*2)
{
// Assigning rank and index values to first suffix
int rank = 0;
int prev_rank = suffixes[0].rank[0];
suffixes[0].rank[0] = rank;
ind[suffixes[0].index] = 0;
// Assigning rank to suffixes
for (int i = 1; i < n; i++)
{
// If first rank and next ranks are same as that of previous
// suffix in array, assign the same new rank to this suffix
if (suffixes[i].rank[0] == prev_rank &&
suffixes[i].rank[1] == suffixes[i-1].rank[1])
{
prev_rank = suffixes[i].rank[0];
suffixes[i].rank[0] = rank;
}
else // Otherwise increment rank and assign
{
prev_rank = suffixes[i].rank[0];
suffixes[i].rank[0] = ++rank;
}
ind[suffixes[i].index] = i;
}
// Assign next rank to every suffix
for (int i = 0; i < n; i++)
{
int nextindex = suffixes[i].index + k/2;
suffixes[i].rank[1] = (nextindex < n)?
suffixes[ind[nextindex]].rank[0]: -1;
}
// Sort the suffixes according to first k characters
sort(suffixes, suffixes+n, cmp);
}
// Store indexes of all sorted suffixes in the suffix array
vector<int>suffixArr;
for (int i = 0; i < n; i++)
suffixArr.push_back(suffixes[i].index);
// Return the suffix array
return suffixArr;
}
/* To construct and return LCP */
vector<int> kasai(string txt, vector<int> suffixArr)
{
int n = suffixArr.size();
// To store LCP array
vector<int> lcp(n, 0);
// An auxiliary array to store inverse of suffix array
// elements. For example if suffixArr[0] is 5, the
// invSuff[5] would store 0. This is used to get next
// suffix string from suffix array.
vector<int> invSuff(n, 0);
// Fill values in invSuff[]
for (int i=0; i < n; i++)
invSuff[suffixArr[i]] = i;
// Initialize length of previous LCP
int k = 0;
// Process all suffixes one by one starting from
// first suffix in txt[]
for (int i=0; i<n; i++)
{
/* If the current suffix is at n-1, then we don’t
have next substring to consider. So lcp is not
defined for this substring, we put zero. */
if (invSuff[i] == n-1)
{
k = 0;
continue;
}
/* j contains index of the next substring to
be considered to compare with the present
substring, i.e., next string in suffix array */
int j = suffixArr[invSuff[i]+1];
// Directly start matching from k'th index as
// at-least k-1 characters will match
while (i+k<n && j+k<n && txt[i+k]==txt[j+k])
k++;
lcp[invSuff[i]] = k; // lcp for the present suffix.
// Deleting the starting character from the string.
if (k>0)
k--;
}
// return the constructed lcp array
return lcp;
}
// Utility function to print an array
void printArr(vector<int>arr, int n)
{
for (int i = 0; i < n; i++)
cout << arr[i] << " ";
cout << endl;
}
// Driver program
int main()
{
int t;
cin >> t;
//t = 1;
while (t > 0) {
//string str = "banana";
string str;
cin >> str; // >> k;
vector<int>suffixArr = buildSuffixArray(str, str.length());
int n = suffixArr.size();
cout << "Suffix Array : \n";
printArr(suffixArr, n);
vector<int>lcp = kasai(str, suffixArr);
cout << "\nLCP Array : \n";
printArr(lcp, n);
// cum will hold number of substrings if that'a what you want (total = cum[n-1]
cum[0] = n - suffixArr[0];
// vector <pair<int,int>> substrs[n];
int count = 1;
for (int i = 1; i <= n-suffixArr[0]; i++) {
//substrs[0].push_back({suffixArr[0],i});
string sub_str = str.substr(suffixArr[0],i);
cout << count << " " << sub_str << endl;
count++;
}
for(int i = 1;i < n;i++) {
cum[i] = cum[i-1] + (n - suffixArr[i] - lcp[i - 1]);
int end = n - suffixArr[i];
int begin = lcp[i-1] + 1;
int begin_suffix = suffixArr[i];
for (int j = begin, k = 1; j <= end; j++, k++) {
//substrs[i].push_back({begin_suffix, lcp[i-1] + k});
// cout << "i push " << i << " " << begin_suffix << " " << k << endl;
string sub_str = str.substr(begin_suffix, lcp[i-1] +k);
cout << count << " " << sub_str << endl;
count++;
}
}
/*int count = 1;
cout << endl;
for(int i = 0; i < n; i++){
for (auto it = substrs[i].begin(); it != substrs[i].end(); ++it ) {
string sub_str = str.substr(it->first, it->second);
cout << count << " " << sub_str << endl;
count++;
}
}*/
t--;
}
return 0;
}
这是一个更简单的算法:
#include <iostream>
#include <string.h>
#include <vector>
#include <string>
#include <algorithm>
#include <time.h>
using namespace std;
char txt[100000], *p[100000];
int m, n;
int cmp(const void *p, const void *q) {
int rc = memcmp(*(char **)p, *(char **)q, m);
return rc;
}
int main() {
std::cin >> txt;
int start_s = clock();
n = strlen(txt);
int k; int i;
int count = 1;
for (m = 1; m <= n; m++) {
for (k = 0; k+m <= n; k++)
p[k] = txt+k;
qsort(p, k, sizeof(p[0]), &cmp);
for (i = 0; i < k; i++) {
if (i != 0 && cmp(&p[i-1], &p[i]) == 0){
continue;
}
char cur_txt[100000];
memcpy(cur_txt, p[i],m);
cur_txt[m] = '\0';
std::cout << count << " " << cur_txt << std::endl;
count++;
}
}
cout << --count << endl;
int stop_s = clock();
float run_time = (stop_s - start_s) / double(CLOCKS_PER_SEC);
cout << endl << "distinct substrings \t\tExecution time = " << run_time << " seconds" << endl;
return 0;
}
虽然这两种算法对于极长的字符串列出的速度太慢了。我用长度超过47,000的字符串测试算法,算法花了20多分钟才完成,第一个算法耗时1200秒,第二个算法占用1360秒,而且只计算独特的子串而不输出到终点站。因此,对于长度可达1000的字符串,您可能会得到一个可行的解决方案。两种解决方案都确实计算了相同的子串总数。我确实测试了两种算法的字符串长度为2000和10,000。时间是第一个算法:0.33秒和12秒;对于第二种算法,它是0.535秒和20秒。所以看起来通常第一种算法更快。
答案 12 :(得分:0)
许多答案包括2个for循环和一个.substring()调用,它们声明O(N ^ 2)时间复杂度。但是,必须注意,在Java中(.Java 7中的更新6),对.substring()调用的最坏情况是O(N)。因此,通过在代码中添加.substring()调用,N的顺序增加了一个。
因此,2个for循环和这些循环中的.substring()调用等于O(N ^ 3)时间复杂度。
答案 13 :(得分:-2)
你的程序没有给出独特的sbstrins。
请使用输入abab
进行测试,输出应为aba,ba,bab,abab
。