leetcode question brushing diary-1044 Longest repeating substring

Posted by vurentjie on Mon, 27 Dec 2021 08:00:47 +0100

  • Title Description:
    Give you a string s and consider all its repeated substrings: that is, the continuous substrings of s appear 2 or more times in S. There may be overlap between these occurrences.
    Returns any repeating substring with the longest length. If s does not contain duplicate substrings, the answer is "".

  • Example:
    Enter: s = "banana"
    Output: "ana"
    Enter: s = "abcd"
    Output: ''
    2 <= s.length <= 3 * 104
    s consists of lowercase English letters

  • Analysis, the difficulty index of this topic is difficult, and it is really difficult. There are many knowledge points to test, but in fact, each knowledge point is not difficult. It is rare how to think of these knowledge points and how to combine these knowledge skills. This difficult topic is generally optimized. Violent cracking is basically impossible to pass. It is generally optimized. Although brute force cracking will certainly fail, this is the starting point for us to consider and optimize. Therefore, if we can't see how to optimize this problem at a glance, we still need to start from brute force cracking.

  • The first step, how to brute force crack? We consider the title, give you a string s, consider the repeated in all consecutive substrings, and return the longest one. First of all, the longest substring of string s is itself, but it cannot be repeated, so it does not meet the meaning of the question. The shortest is that there is no repeated substring. This is a good judgment, such as len(set(s)==len(s). We will not consider this situation first. We consider the case that there is a repeated substring, then its length range is (1, n-1), where n is the length of string s, then our idea of brute force cracking comes, and we simplify the problem: suppose a substring of fixed length k is given, assuming that the substring of fixed length must be repeated, and return the first substring in S. Is it easy to think of sliding window to solve this problem, that is, fix a fixed length substring and traverse it all the way back. The time complexity is k*(n-k)*(n-k), K is a constant, the time complexity of string matching is k, that is, the time complexity is O(n2), plus the sliding window length (1, n-1) mentioned earlier, the time complexity is O(n), The total time complexity is at least O(n3).

  • The second step is to consider how to optimize. First, can the selection of sliding window be optimized? Of course, the answer is yes. The best way to find a worthy in an ordered interval must be dichotomy. The dichotomy idea is like this. If the repeated substring can be found at n/2, the longest repeated substring must be greater than or equal to n/2, otherwise, it must be less than or equal to n/2. Through this idea, the time complexity O(n) of the first layer cycle can be changed into O(log n).

  • Step 3: after completing the second optimization, the total time complexity is O(n2 log n) and still can't pass. It indicates that the optimization still needs to be carried out. The only place that can be optimized here is the sliding window. We need the O(n2) here to decline. Here we may think of Rabin Karp algorithm. Unexpectedly, it doesn't matter. We will do it next time. Rabin Karp algorithm is a string matching algorithm. The matching time complexity is reduced to O(n), so our final time complexity is O(n log n). Pass! The principle of Rabin Karp algorithm is actually very simple, that is, each substring is hashed. For example, this problem is all lowercase letters. We can use ASCII coding to represent it, so that the original string becomes an array, We divide these arrays into M (sliding window length) for long substrings, there are 26 lowercase letters in total, so we can use 26 base to represent the hash value corresponding to m long substrings. If the two strings match, the hash value must be equal. However, considering the hash conflict, the hash values of two different substrings may also be equal. Here are two solutions. One is to use multiple base, such as 27 base And binary 29, which greatly reduces the possibility of hash conflict, but it is still possible. Second, if the hash values are equal, directly compare the two corresponding strings. In this way, although the time complexity will be high, the time complexity needs to be multiplied by an M. this problem can not pass, but hash conflict can be completely avoided.
      two reference links are attached here. I don't think they are very clear. In the next issue, I'll update the principle and implementation of Rabin Karp algorithm.

  • Rabin Karp algorithm: string matching problem

  • https://zhuanlan.zhihu.com/p/93429400

  • The solution to weaken the conflict by two hash es is as follows (official idea):

class Solution:
    def longestDupSubstring(self, s: str) -> str:
        """
        Give you a string s ,Consider all its repeated substrings: that is, s Continuous substrings of, in s Occurs 2 or more times in. There may be overlap between these occurrences 	              
        >>>self.longestDupSubstring("banana")
        >>>"ana"
        """
        # Generate two base numbers
        a1, a2 = random.randint(26, 100), random.randint(26, 100)
        # Generate two modules
        mod1, mod2 = random.randint(10**9+7, 2**31-1), random.randint(10**9+7, 2**31-1)
        arr = [ord(c)-ord('a') for c in s]   
        def check(arr, m, a1, a2, mod1, mod2):
            n = len(arr)
            aL1, aL2 = pow(a1, m, mod1), pow(a2, m, mod2)
            hash1, hash2 = 0, 0
            for i in range(m):
                hash1 = (hash1 * a1 + arr[i]) % mod1
                hash2 = (hash2 * a2 + arr[i]) % mod2
            seen = {(hash1, hash2)}
            for start in range(1, n - m + 1):
                hash1 = (hash1 * a1 - arr[start - 1] * aL1 + arr[start + m - 1]) % mod1
                hash2 = (hash2 * a2 - arr[start - 1] * aL2 + arr[start + m - 1]) % mod2
                if (hash1, hash2) in seen:
                    return start
                seen.add((hash1, hash2))
            return -1
        left, right = 1, len(s)-1
        ans, maxLength = -1, 0
        while left <= right:
            mid = left + (right - left + 1) // 2
            repeatIndex = check(arr, mid, a1, a2, mod1, mod2)
            if repeatIndex != -1:
                left = mid+1
                ans = repeatIndex
                maxLength = mid
            else:
                right = mid - 1
        return s[ans:ans+maxLength] if ans != -1 else ""
  • The solution to completely avoid hash conflict is as follows (the force buckle cannot pass):
from collections import defaultdict
def longestDupSubstring(s):
    """
    Give you a string s ,Consider all its repeated substrings: that is, s Continuous substrings of, in s Occurs 2 or more times in. There may be overlap between these occurrences 	     >>>longestDupSubstring("nnpxouomcofdjuujloanjimymadkuepightrfodmauhrsy")
    >>>"ma"
    """
    arr = [ord(c)-ord('a') for c in s]
    modulo = 2**64-1
    def check(arr, m):
        visited = defaultdict(list)
        hashM = 0
        for i in range(m):
            hashM = hashM*26 + arr[i]
        visited[hashM%modulo].append(0)
        for j in range(m, len(arr)):
            hashM = hashM - arr[j-m]*(26**(m-1))
            hashM = (hashM*26 + arr[j])%modulo
            if hashM in visited.keys():
                for v in visited[hashM]:
                    if s[j-m+1:j+1] == s[v:v+m]:
                        return v
            visited[hashM].append(j-m+1)
        return -1
    
    left, right = 0, len(s)-1
    ans, maxLength = -1, 0
    while left <= right:
        mid = (left + right) // 2
        repeatIndex = check(arr, mid)
        if repeatIndex != -1:
            left = mid+1
            ans = repeatIndex
            maxLength = mid
        else:
            right = mid - 1
    return s[ans:ans+maxLength] if ans != -1 else -1 
     
s = "nnpxouomcofdjuujloanjimymadkuepightrfodmauhrsy"
longestDupSubstring(s)

Topics: leetcode Binary Search