Quantcast
Channel: CodeMaestro » Articles
Viewing all articles
Browse latest Browse all 7

Easy tricks for optimizing common string operations

$
0
0


When implementing a data structure that contains strings, or a string based key, its most basic functionality would probably be to search the existence of a query string. Whether this data structure uses hashes, trees or any other pattern, it all boils down to comparing your query string against strings from the repository. With each search we try to find our query string, or prove that it doesn’t already exist, with as few string comparisons as possible.

The actual string comparison is an aspect sometimes overlooked. This article lists some easy tricks that could make your string comparisons run much faster.

When approaching the string comparison optimization problem, what we would like to do is to provide effective and efficient ways to rule out most of the candidate strings. We may refer to it as a “disqualifying comparison” – it lets us move faster down the search tree or move faster along the hash bucket linked list, until reaching the final string comparison in the search, keeping in mind that even the most efficient hash structure would probably waste a substantial amount of its time and cycles in string comparison.

Note that this article assumes all strings in the world are composed of 1 byte characters, which is not true (unfortunately for our community, someone invented Unicode…). However, to demonstrate the principals let’s assume they do.

A. Different lengths is sometimes enough

When the only need is to find out whether two strings are identical or not, checking that string lengths differ is enough for ruling out most of the compared candidates. Of course, that doesn’t mean calculating the lengths of strings for each comparison – In common practice, usually, all that needs to be done is storing the string lengths along with the actual strings in the data structures, and designing the software so that the query will also contain both the actual string and its length. Therefore, in runtime, there will be no need for calculating lengths for most comparisons.
Only when the string lengths are the same, there is a need to actually compare the strings themselves.
Obviously, the above optimization would be very effective only for data repositories that contains strings with different lengths, but in these cases,for each string, we minimize all the comparison operations to only one. When comparing user names, URLs, or other human readable resource names, this little optimization would prove itself.

B. Why compare byte by byte anyway?

It turns out that some (if not most) implementations of libc use a byte by byte comparison for strcmp and other string comparison functions. For example, a snippet from OpenBSD’s libc implementation (found with Google Code Search):

int
strcmp(const char *s1, const char *s2)
{
        while (*s1 == *s2++)
                if (*s1++ == 0)
                        return (0);
        return (*(unsigned char *)s1 - *(unsigned char *)--s2);
}

A much more efficient implementation would compare elements according to the processor’s block size, size_t bytes usually. A 64 bit architecture is able to compare 8 ASCII characters in one cycle – why not use it?
In many cases, a simple solution would be to use memcmp instead of strcmp. In many platforms, memcmp is implemented very efficiently – some by using a block size comparison in C as in this glibc implementation, and some even implement it very carefully in assembly, such as Sparc64 linux kernel implementation). When you know your memcmp is not as efficient, it’s just implementing your own function, relying on implementations such as these from the newest glibc you can find.

C. Direction counts

In some cases, all strings have similar characteristics, specifically – similar prefixes or suffixes. For example, a repository of phone numbers is likely to have many similar prefixes, whether the same country code for USA or even the same area code prefixes for all the phone numbers in the same state. In our example, the length comparison optimization described above would not be so effective since all the phone numbers in the same state have the same lengths.

However, an effective optimization here is to implement reverse order string comparison function. This would rule out most of the strings is much faster than using regular comparison methods.
Going forward with the same optimization method, looking at the string characteristic may imply the optimal comparison function to use; As an example, let’s examine a repository with picture file names. Each file name is likely to begin with, let’s say, the prefix “pic” and end with the file extension, which is probably “.jpg”. Therefore, an effective disqualifying comparison would probably be a reverse order comparison starting four characters from the end.

D. Boost your case insensitive comparisons

Some string based repositories are required to be case insensitive. Therefore, given a search string, a naive implementation would first transform the string to lower case letters and only then search for it. However, this means copying the query string. Given that most of the operations to be done are queries, there must be a way to avoid these string copy operations.
A nice solution found on one GameDev forum is using the magic constant 0xDF. It relies on the fact that the difference between lower case and upper case ASCII characters is only in the 6th bit. Therefore, a simple bitwise operation for each character comparison could make this comparison case insensitive. So a single case-insensitive character comparison would look like this, assuming your repository entries are already in lower case:

query[i] & 0xDF == dbString[i]

When expanding this method to be used in conjunction with the optimization in section B above, a case insensitive comparison of 4 characters on a 32 bit architecture would be with a bitwise operation of & 0xDFDFDFDF !

query32bitValue[i] & 0xDFDFDFDF == dbString32bitValue[i]

The optimal string comparison function

Perhaps the most important conclusion here, is that there is no such thing as an optimal string comparison function to copy-paste from this article. When implementing a strings repository, special consideration may be made to find your specific optimal string comparison function – suited to your own specific needs and your own specific data characteristics.


Viewing all articles
Browse latest Browse all 7

Latest Images





Latest Images