Jump to content

Talk:Bitap algorithm

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Quuxplusone (talk | contribs) at 17:19, 8 November 2005 (response to Nroets). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Someone wrote 'it performs best on patterns less than a constant length'. Such statements should not be made without adequite analysis. It is true that patterns that are 33 characters long may take twice as long as patterns of length 32, but if the algorithm beats the hell out of all its competitors, or it takes 2 nanoseconds instead of 1, then there's no reason not to use it.

Then the paragraph says the complexity in this case is O(m+n). But if m is limited to 32, then O(m+n) is the same as O(n) because O(constant) is 0.

Perhaps we should say that for arbitrary m and n, the algorithm has complexity O(kmn). Now this may look inefficient, but if you consider that modern processors can perform in the region of 64 billion of these operations every second, you'll understand why the algorithm is so fast. (unsigned comment by User:Nroets, 8 Nov 2005)

First point: Bitap does perform best on patterns less than the word length of the machine, because then it can be optimized to use bitwise operations and shifts. If you want to use bitap for longer patterns, you'll need to "pessimize" it either by using the array-of-BIT approach given in the article's first snippet, or by doing awkward things with carried bits. This has nothing to do with comparing bitap to other search algorithms; it's simply stating that bitap itself performs better on small patterns than on long ones.
Second point: On rereading, I can't figure out why O(m+n), either. So I've changed it to O(mn) — we iterate once for each of n text characters, and O(m) times over the bit array for each of those characters. There is an additional O(m) term for setting up the bit array, and in the article's optimized implementations there is an O(k) term where k is the size of the alphabet. So it's most like O(mn+n+k), but the O(mn) term is definitely the dominant one in non-pathological cases. --Quuxplusone 17:19, 8 November 2005 (UTC)[reply]