Jump to content

Talk:Bitap algorithm

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Quuxplusone (talk | contribs) at 06:19, 4 January 2009 (Comment on code sample: code is OK). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Someone wrote 'it performs best on patterns less than a constant length'. Such statements should not be made without adequite analysis. It is true that patterns that are 33 characters long may take twice as long as patterns of length 32, but if the algorithm beats the hell out of all its competitors, or it takes 2 nanoseconds instead of 1, then there's no reason not to use it.

Then the paragraph says the complexity in this case is O(m+n). But if m is limited to 32, then O(m+n) is the same as O(n) because O(constant) is 0.

Perhaps we should say that for arbitrary m and n, the algorithm has complexity O(kmn). Now this may look inefficient, but if you consider that modern processors can perform in the region of 64 billion of these operations every second, you'll understand why the algorithm is so fast. (unsigned comment by User:Nroets, 8 Nov 2005)

First point: Bitap does perform best on patterns less than the word length of the machine, because then it can be optimized to use bitwise operations and shifts. If you want to use bitap for longer patterns, you'll need to "pessimize" it either by using the array-of-BIT approach given in the article's first snippet, or by doing awkward things with carried bits. This has nothing to do with comparing bitap to other search algorithms; it's simply stating that bitap itself performs better on small patterns than on long ones.
Second point: On rereading, I can't figure out why O(m+n), either. So I've changed it to O(mn) — we iterate once for each of n text characters, and O(m) times over the bit array for each of those characters. There is an additional O(m) term for setting up the bit array, and in the article's optimized implementations there is an O(k) term where k is the size of the alphabet. So it's most like O(mn+n+k), but the O(mn) term is definitely the dominant one in non-pathological cases. --Quuxplusone 17:19, 8 November 2005 (UTC)[reply]

Comment on code sample

User:134.2.247.43 added a comment to the code in the article pointing out a potential error. I'm not judging at the moment whether they're correct, but it should be discussed here on the talk page to avoid self-contradiction in the article. Dcoetzee 20:25, 20 October 2008 (UTC)[reply]

#include <string.h>
#include <limits.h>

const char *bitap_bitwise_search(const char *text, const char *pattern)
{
    int m = strlen(pattern);
    unsigned long R;
    unsigned long pattern_mask[CHAR_MAX+1];
    int i;

    if (pattern[0] == '\0') return text;
    if (m > 31) return "The pattern is too long!";
 
    /* Initialize the bit array R */
    R = ~1;  /* I think this is wrong, because then we have no match if text starts with pattern. It should be (~1)<<1. */

    /* Initialize the pattern bitmasks */
    for (i=0; i <= CHAR_MAX; ++i)
      pattern_mask[i] = ~0;
    for (i=0; i < m; ++i)
      pattern_mask[pattern[i]] &= ~(1UL << i);

    for (i=0; text[i] != '\0'; ++i) {
        /* Update the bit array */
        R |= pattern_mask[text[i]];
        R <<= 1;

        if (0 == (R & (1UL << m)))
          return (text+i - m) + 1;
    }

    return NULL;
}

Tracing the code (or thirty seconds with a test case and a C compiler) shows that User:134.2.247.43's concern is unfounded. --Quuxplusone (talk) 06:19, 4 January 2009 (UTC)[reply]