Better String library --------------------- by Paul Hsieh The bstring library is an attempt to provide improved string processing functionality to the C and C++ language. At the heart of the bstring library (Bstrlib for short) is the management of "bstring"s which are a significant improvement over '\0' terminated char buffers. =============================================================================== Motivation ---------- The standard C string library has serious problems: 1) Its use of '\0' to denote the end of the string means knowing a string's length is O(n) when it could be O(1). 2) It imposes an interpretation for the character value '\0'. 3) gets() always exposes the application to a buffer overflow. 4) strtok() modifies the string its parsing and thus may not be usable in programs which are re-entrant or multithreaded. 5) fgets has the unusual semantic of ignoring '\0's that occur before '\n's are consumed. 6) There is no memory management, and actions performed such as strcpy, strcat and sprintf are common places for buffer overflows. 7) strncpy() doesn't '\0' terminate the destination in some cases. 8) Passing NULL to C library string functions causes an undefined NULL pointer access. 9) Parameter aliasing (overlapping, or self-referencing parameters) within most C library functions has undefined behavior. 10) Many C library string function calls take integer parameters with restricted legal ranges. Parameters passed outside these ranges are not typically detected and cause undefined behavior. So the desire is to create an alternative string library that does not suffer from the above problems and adds in the following functionality: 1) Incorporate string functionality seen from other languages. a) MID$() - from BASIC b) split()/join() - from Python c) string/char x n - from Perl 2) Implement analogs to functions that combine stream IO and char buffers without creating a dependency on stream IO functionality. 3) Implement the basic text editor-style functions insert, delete, find, and replace. 4) Implement reference based sub-string access (as a generalization of pointer arithmetic.) 5) Implement runtime write protection for strings. There is also a desire to avoid "API-bloat". So functionality that can be implemented trivially in other functionality is omitted. So there is no left$() or right$() or reverse() or anything like that as part of the core functionality. Explaining Bstrings ------------------- A bstring is basically a header which wraps a pointer to a char buffer. Lets start with the declaration of a struct tagbstring: struct tagbstring { int mlen; int slen; unsigned char * data; }; This definition is considered exposed, not opaque (though it is neither necessary nor recommended that low level maintenance of bstrings be performed whenever the abstract interfaces are sufficient). The mlen field (usually) describes a lower bound for the memory allocated for the data field. The slen field describes the exact length for the bstring. The data field is a single contiguous buffer of unsigned chars. Note that the existence of a '\0' character in the unsigned char buffer pointed to by the data field does not necessarily denote the end of the bstring. To be a well formed modifiable bstring the mlen field must be at least the length of the slen field, and slen must be non-negative. Furthermore, the data field must point to a valid buffer in which access to the first mlen characters has been acquired. So the minimal check for correctness is: (slen >= 0 && mlen >= slen && data != NULL) bstrings returned by bstring functions can be assumed to be either NULL or satisfy the above property. (When bstrings are only readable, the mlen >= slen restriction is not required; this is discussed later in this section.) A bstring itself is just a pointer to a struct tagbstring: typedef struct tagbstring * bstring; Note that use of the prefix "tag" in struct tagbstring is required to work around the inconsistency between C and C++'s struct namespace usage. This definition is also considered exposed. Bstrlib basically manages bstrings allocated as a header and an associated data-buffer. Since the implementation is exposed, they can also be constructed manually. Functions which mutate bstrings assume that the header and data buffer have been malloced; the bstring library may perform free() or realloc() on both the header and data buffer of any bstring parameter. Functions which return bstring's create new bstrings. The string memory is freed by a bdestroy() call (or using the bstrFree macro). The following related typedef is also provided: typedef const struct tagbstring * const_bstring; which is also considered exposed. These are directly bstring compatible (no casting required) but are just used for parameters which are meant to be non-mutable. So in general, bstring parameters which are read as input but not meant to be modified will be declared as const_bstring, and bstring parameters which may be modified will be declared as bstring. This convention is recommended for user written functions as well. Since bstrings maintain interoperability with C library char-buffer style strings, all functions which modify, update or create bstrings also append a '\0' character into the position slen + 1. This trailing '\0' character is not required for bstrings input to the bstring functions; this is provided solely as a convenience for interoperability with standard C char-buffer functionality. Analogs for the ANSI C string library functions have been created when they are necessary, but have also been left out when they are not. In particular there are no functions analogous to fwrite, or puts just for the purposes of bstring. The ->data member of any string is exposed, and therefore can be used just as easily as char buffers for C functions which read strings. For those that wish to hand construct bstrings, the following should be kept in mind: 1) While bstrlib can accept constructed bstrings without terminating '\0' characters, the rest of the C language string library will not function properly on such non-terminated strings. This is obvious but must be kept in mind. 2) If it is intended that a constructed bstring be written to by the bstring library functions then the data portion should be allocated by the malloc function and the slen and mlen fields should be entered properly. The struct tagbstring header is not reallocated, and only freed by bdestroy. 3) Writing arbitrary '\0' characters at various places in the string will not modify its length as perceived by the bstring library functions. In fact, '\0' is a legitimate non-terminating character for a bstring to contain. 4) For read only parameters, bstring functions do not check the mlen. I.e., the minimal correctness requirements are reduced to: (slen >= 0 && data != NULL) Better pointer arithmetic ------------------------- One built-in feature of '\0' terminated char * strings, is that its very easy and fast to obtain a reference to the tail of any string using pointer arithmetic. Bstrlib does one better by providing a way to get a reference to any substring of a bstring (or any other length delimited block of memory.) So rather than just having pointer arithmetic, with bstrlib one essentially has segment arithmetic. This is achieved using the macro blk2tbstr() which builds a reference to a block of memory and the macro bmid2tbstr() which builds a reference to a segment of a bstring. Bstrlib also includes functions for direct consumption of memory blocks into bstrings, namely bcatblk () and blk2bstr (). One scenario where this can be extremely useful is when string contains many substrings which one would like to pass as read-only reference parameters to some string consuming function without the need to allocate entire new containers for the string data. More concretely, imagine parsing a command line string whose parameters are space delimited. This can only be done for tails of the string with '\0' terminated char * strings. Improved NULL semantics and error handling ------------------------------------------ Unless otherwise noted, if a NULL pointer is passed as a bstring or any other detectably illegal parameter, the called function will return with an error indicator (either NULL or BSTR_ERR) rather than simply performing a NULL pointer access, or having undefined behavior. To illustrate the value of this, consider the following example: strcpy (p = malloc (13 * sizeof (char)), "Hello,"); strcat (p, " World"); This is not correct because malloc may return NULL (due to an out of memory condition), and the behaviour of strcpy is undefined if either of its parameters are NULL. However: bstrcat (p = bfromcstr ("Hello,"), q = bfromcstr (" World")); bdestroy (q); is well defined, because if either p or q are assigned NULL (indicating a failure to allocate memory) both bstrcat and bdestroy will recognize it and perform no detrimental action. Note that it is not necessary to check any of the members of a returned bstring for internal correctness (in particular the data member does not need to be checked against NULL when the header is non-NULL), since this is assured by the bstring library itself. bStreams -------- In addition to the bgets and bread functions, bstrlib can abstract streams with a high performance read only stream called a bStream. In general, the idea is to open a core stream (with something like fopen) then pass its handle as well as a bNread function pointer (like fread) to the bsopen function which will return a handle to an open bStream. Then the functions bsread, bsreadln or bsreadlns can be called to read portions of the stream. Finally, the bsclose function is called to close the bStream -- it will return a handle to the original (core) stream. So bStreams, essentially, wrap other streams. The bStreams have two main advantages over the bgets and bread (as well as fgets/ungetc) paradigms: 1) Improved functionality via the bunread function which allows a stream to unread characters, giving the bStream stack-like functionality if so desired. 2) A very high performance bsreadln function. The C library function fgets() (and the bgets function) can typically be written as a loop on top of fgetc(), thus paying all of the overhead costs of calling fgetc on a per character basis. bsreadln will read blocks at a time, thus amortizing the overhead of fread calls over many characters at once. However, clearly bStreams are suboptimal or unusable for certain kinds of streams (stdin) or certain usage patterns (a few spotty, or non-sequential reads from a slow stream.) For those situations, using bgets will be more appropriate. The semantics of bStreams allows practical construction of layerable data streams. What this means is that by writing a bNread compatible function on top of a bStream, one can construct a new bStream on top of it. This can be useful for writing multi-pass parsers that don't actually read the entire input more than once and don't require the use of intermediate storage. Aliasing -------- Aliasing occurs when a function is given two parameters which point to data structures which overlap in the memory they occupy. While this does not disturb read only functions, for many libraries this can make functions that write to these memory locations malfunction. This is a common problem of the C standard library and especially the string functions in the C standard library. The C standard string library is entirely char by char oriented (as is bstring) which makes conforming implementations alias safe for some scenarios. However no actual detection of aliasing is typically performed, so it is easy to find cases where the aliasing will cause anomolous or undesirable behaviour (consider: strcat (p, p).) The C99 standard includes the "restrict" pointer modifier which allows the compiler to document and assume a no-alias condition on usage. However, only the most trivial cases can be caught (if at all) by the compiler at compile time, and thus there is no actual enforcement of non-aliasing. Bstrlib, by contrast, permits aliasing and is completely aliasing safe, in the C99 sense of aliasing. That is to say, under the assumption that pointers of incompatible types from distinct objects can never alias, bstrlib is completely aliasing safe. (In practice this means that the data buffer portion of any bstring and header of any bstring are assumed to never alias.) With the exception of the reference building macros, the library behaves as if all read-only parameters are first copied and replaced by temporary non-aliased parameters before any writing to any output bstring is performed (though actual copying is extremely rarely ever done.) Besides being a useful safety feature, bstring searching/comparison functions can improve to O(1) execution when aliasing is detected. Note that aliasing detection and handling code in Bstrlib is generally extremely cheap. There is almost never any appreciable performance penalty for using aliased parameters. Reenterancy ----------- Nearly every function in Bstrlib is a leaf function, and is completely reenterable with the exception of writing to common bstrings. The split functions which use a callback mechanism requires only that the source string not be destroyed by the callback function unless the callback function returns with an error status (note that Bstrlib functions which return an error do not modify the string in any way.) The string can in fact be modified by the callback and the behaviour is deterministic. See the documentation of the various split functions for more details. Undefined scenarios ------------------- One of the basic important premises for Bstrlib is to not to increase the propogation of undefined situations from parameters that are otherwise legal in of themselves. In particular, except for extremely marginal cases, usages of bstrings that use the bstring library functions alone cannot lead to any undefined action. But due to C/C++ language and library limitations, there is no way to define a non-trivial library that is completely without undefined operations. All such possible undefined operations are described below: 1) bstrings or struct tagbstrings that are not explicitely initialized cannot be passed as a parameter to any bstring function. 2) The members of the NULL bstring cannot be accessed directly. (Though all APIs and macros detect the NULL bstring.) 3) A bstring whose data member has not been obtained from a malloc or compatible call and which is write accessible passed as a writable parameter will lead to undefined results. (i.e., do not writeAllow any constructed bstrings unless the data portion has been obtained from the heap.) 4) If the headers of two strings alias but are not identical (which can only happen via a defective manual construction), then passing them to a bstring function in which one is writable is not defined. 5) If the mlen member is larger than the actual accessible length of the data member for a writable bstring, or if the slen member is larger than the readable length of the data member for a readable bstring, then the corresponding bstring operations are undefined. 6) Any bstring definition whose header or accessible data portion has been assigned to inaccessible or otherwise illegal memory clearly cannot be acted upon by the bstring library in any way. 7) Destroying the source of an incremental split from within the callback and not returning with a negative value (indicating that it should abort) will lead to undefined behaviour. (Though *modifying* or adjusting the state of the source data, even if those modification fail within the bstrlib API, has well defined behavior.) 8) Modifying a bstring which is write protected by direct access has undefined behavior. While this may seem like a long list, with the exception of invalid uses of the writeAllow macro, and source destruction during an iterative split without an accompanying abort, no usage of the bstring API alone can cause any undefined scenario to occurr. I.e., the policy of restricting usage of bstrings to the bstring API can significantly reduce the risk of runtime errors (in practice it should eliminate them) related to string manipulation due to undefined action. C++ wrapper ----------- A C++ wrapper has been created to enable bstring functionality for C++ in the most natural (for C++ programers) way possible. The mandate for the C++ wrapper is different from the base C bstring library. Since the C++ language has far more abstracting capabilities, the CBString structure is considered fully abstracted -- i.e., hand generated CBStrings are not supported (though conversion from a struct tagbstring is allowed) and all detectable errors are manifest as thrown exceptions. - The C++ class definitions are all under the namespace Bstrlib. bstrwrap.h enables this namespace (with a using namespace Bstrlib; directive at the end) unless the macro BSTRLIB_DONT_ASSUME_NAMESPACE has been defined before it is included. - Erroneous accesses results in an exception being thrown. The exception parameter is of type "struct CBStringException" which is derived from std::exception if STL is used. A verbose description of the error message can be obtained from the what() method. - CBString is a C++ structure derived from a struct tagbstring. An address of a CBString cast to a bstring must not be passed to bdestroy. The bstring C API has been made C++ safe and can be used directly in a C++ project. - It includes constructors which can take a char, '\0' terminated char buffer, tagbstring, (char, repeat-value), a length delimited buffer or a CBStringList to initialize it. - Concatenation is performed with the + and += operators. Comparisons are done with the ==, !=, <, >, <= and >= operators. Note that == and != use the biseq call, while <, >, <= and >= use bstrcmp. - CBString's can be directly cast to const character buffers. - CBString's can be directly cast to double, float, int or unsigned int so long as the CBString are decimal representations of those types (otherwise an exception will be thrown). Converting the other way should be done with the format(a) method(s). - CBString contains the length, character and [] accessor methods. The character and [] accessors are aliases of each other. If the bounds for the string are exceeded, an exception is thrown. To avoid the overhead for this check, first cast the CBString to a (const char *) and use [] to dereference the array as normal. Note that the character and [] accessor methods allows both reading and writing of individual characters. - The methods: format, formata, find, reversefind, findcaseless, reversefindcaseless, midstr, insert, insertchrs, replace, findreplace, findreplacecaseless, remove, findchr, nfindchr, alloc, toupper, tolower, gets, read are analogous to the functions that can be found in the C API. - The caselessEqual and caselessCmp methods are analogous to biseqcaseless and bstricmp functions respectively. - Note that just like the bformat function, the format and formata methods do not automatically cast CBStrings into char * strings for "%s"-type substitutions: CBString w("world"); CBString h("Hello"); CBString hw; /* The casts are necessary */ hw.format ("%s, %s", (const char *)h, (const char *)w); - The methods trunc and repeat have been added instead of using pattern. - ltrim, rtrim and trim methods have been added. These remove characters from a given character string set (defaulting to the whitespace characters) from either the left, right or both ends of the CBString, respectively. - The method setsubstr is also analogous in functionality to bsetstr, except that it cannot be passed NULL. Instead the method fill and the fill-style constructor have been supplied to enable this functionality. - The writeprotect(), writeallow() and iswriteprotected() methods are analogous to the bwriteprotect(), bwriteallow() and biswriteprotected() macros in the C API. Write protection semantics in CBString are stronger than with the C API in that indexed character assignment is checked for write protection. However, unlike with the C API, a write protected CBString can be destroyed by the destructor. - CBStream is a C++ structure which wraps a struct bStream (its not derived from it, since destruction is slightly different). It is constructed by passing in a bNread function pointer and a stream parameter cast to void *. This structure includes methods for detecting eof, setting the buffer length, reading the whole stream or reading entries line by line or block by block, an unread function, and a peek function. - If STL is available, the CBStringList structure is derived from a vector of CBString with various split methods. The split method has been overloaded to accept either a character or CBString as the second parameter (when the split parameter is a CBString any character in that CBString is used as a seperator). The splitstr method takes a CBString as a substring seperator. Joins can be performed via a CBString constructor which takes a CBStringList as a parameter, or just using the CBString::join() method. - If there is proper support for std::iostreams, then the >> and << operators and the getline() function have been added (with semantics the same as those for std::string). Multithreading -------------- A mutable bstring is kind of analogous to a small (two entry) linked list allocated by malloc, with all aliasing completely under programmer control. I.e., manipulation of one bstring will never affect any other distinct bstring unless explicitely constructed to do so by the programmer via hand construction or via building a reference. Bstrlib also does not use any static or global storage, so there are no hidden unremovable race conditions. Bstrings are also clearly not inherently thread local. So just like char *'s, bstrings can be passed around from thread to thread and shared and so on, so long as modifications to a bstring correspond to some kind of exclusive access lock as should be expected (or if the bstring is read-only, which can be enforced by bstring write protection) for any sort of shared object in a multithreaded environment. Bsafe module ------------ For convenience, a bsafe module has been included. The idea is that if this module is included, inadvertant usage of the most dangerous C functions will be overridden and lead to an immediate run time abort. Of course, it should be emphasized that usage of this module is completely optional. The intention is essentially to provide an option for creating project safety rules which can be enforced mechanically rather than socially. This is useful for larger, or open development projects where its more difficult to enforce social rules or "coding conventions". Problems not solved ------------------- Bstrlib is written for the C and C++ languages, which have inherent weaknesses that cannot be easily solved: 1. Memory leaks: Forgetting to call bdestroy on a bstring that is about to be unreferenced, just as forgetting to call free on a heap buffer that is about to be dereferenced. Though bstrlib itself is leak free. 2. Read before write usage: In C, declaring an auto bstring does not automatically fill it with legal/valid contents. This problem has been somewhat mitigated in C++. (The bstrDeclare and bstrFree macros from bstraux can be used to help mitigate this problem.) Other problems not addressed: 3. Built-in mutex usage to automatically avoid all bstring internal race conditions in multitasking environments: The problem with trying to implement such things at this low a level is that it is typically more efficient to use locks in higher level primitives. There is also no platform independent way to implement locks or mutexes. Note that except for spotty support of wide characters, the default C standard library does not address any of these problems either. Configurable compilation options -------------------------------- The Better String Library is not an application, it is a library. To compile it, you need to compile bstrlib.c to an object file that is linked to your application. A Makefile might contain entries such as the following to accomplish this: BSTRDIR = $(CDIR)/bstrlib INCLUDES = -I$(BSTRDIR) BSTROBJS = $(ODIR)/bstrlib.o DEFINES = CFLAGS = -O3 -Wall -pedantic -ansi -s $(DEFINES) application: $(ODIR)/main.o $(BSTROBJS) echo Linking: $@ $(CC) $< $(BSTROBJS) -o $@ $(ODIR)/%.o : $(BSTRDIR)/%.c echo Compiling: $< $(CC) $(CFLAGS) $(INCLUDES) -c $< -o $@ $(ODIR)/%.o : %.c echo Compiling: $< $(CC) $(CFLAGS) $(INCLUDES) -c $< -o $@ You can configure bstrlib using with the standard macro defines passed to the compiler. All configuration options are meant solely for the purpose of compiler compatibility. Configuration options are not meant to change the semantics or capabilities of the library, except where it is unavoidable. Since some C++ compilers don't include the Standard Template Library and some have the options of disabling exception handling, a number of macros can be used to conditionally compile support for each of this: BSTRLIB_CAN_USE_STL - defining this will enable the used of the Standard Template Library. Defining BSTRLIB_CAN_USE_STL overrides the BSTRLIB_CANNOT_USE_STL macro. BSTRLIB_CANNOT_USE_STL - defining this will disable the use of the Standard Template Library. Defining BSTRLIB_CAN_USE_STL overrides the BSTRLIB_CANNOT_USE_STL macro. BSTRLIB_CAN_USE_IOSTREAM - defining this will enable the used of streams from class std. Defining BSTRLIB_CAN_USE_IOSTREAM overrides the BSTRLIB_CANNOT_USE_IOSTREAM macro. BSTRLIB_CANNOT_USE_IOSTREAM - defining this will disable the use of streams from class std. Defining BSTRLIB_CAN_USE_IOSTREAM overrides the BSTRLIB_CANNOT_USE_IOSTREAM macro. BSTRLIB_THROWS_EXCEPTIONS - defining this will enable the exception handling within bstring. Defining BSTRLIB_THROWS_EXCEPTIONS overrides the BSTRLIB_DOESNT_THROWS_EXCEPTIONS macro. BSTRLIB_DOESNT_THROW_EXCEPTIONS - defining this will disable the exception handling within bstring. Defining BSTRLIB_THROWS_EXCEPTIONS overrides the BSTRLIB_DOESNT_THROW_EXCEPTIONS macro. Note that these macros must be defined consistently throughout all modules that use CBStrings including bstrwrap.cpp. Some older C compilers do not support functions such as vsnprintf. This is handled by the following macro variables: BSTRLIB_NOVSNP - defining this indicates that the compiler does not support vsnprintf. This will cause bformat and bformata to not be declared. Note that for some compilers, such as Turbo C, this is set automatically. Defining BSTRLIB_NOVSNP overrides the BSTRLIB_VSNP_OK macro. BSTRLIB_VSNP_OK - defining this will disable the autodetection of compilers that do not vsnprintf. Defining BSTRLIB_NOVSNP overrides the BSTRLIB_VSNP_OK macro. Semantic compilation options ---------------------------- Bstrlib comes with very few compilation options for changing the semantics of of the library. These are described below. BSTRLIB_DONT_ASSUME_NAMESPACE - Defining this before including bstrwrap.h will disable the automatic enabling of the Bstrlib namespace for the C++ declarations. BSTRLIB_DONT_USE_VIRTUAL_DESTRUCTOR - Defining this will make the CBString destructor non-virtual. BSTRLIB_MEMORY_DEBUG - Defining this will cause the bstrlib modules bstrlib.c and bstrwrap.cpp to invoke a #include "memdbg.h". memdbg.h has to be supplied by the user. Note that these macros must be defined consistently throughout all modules that use bstrings or CBStrings including bstrlib.c, bstraux.c and bstrwrap.cpp. =============================================================================== Files ----- Core C files (required for C and C++): bstrlib.c - C implementaion of bstring functions. bstrlib.h - C header file for bstring functions. Core C++ files (required for C++): bstrwrap.cpp - C++ implementation of CBString. bstrwrap.h - C++ header file for CBString. Base Unicode support: utf8util.c - C implemention of generic utf8 parsing functions. utf8util.h - C head file for generic utf8 parsing functions. buniutil.c - C implemention utf8 bstring packing and unpacking functions. buniutil.c - C header file for utf8 bstring functions. Extra utility functions: bstraux.c - C example that implements trivial additional functions. bstraux.h - C header for bstraux.c Miscellaneous: bstest.c - C unit/regression test for bstrlib.c test.cpp - C++ unit/regression test for bstrwrap.cpp bsafe.c - C runtime stubs to abort usage of unsafe C functions. bsafe.h - C header file for bsafe.c functions. C modules need only include bstrlib.h and compile/link bstrlib.c to use the basic bstring library. C++ projects need to additionally include bstrwrap.h and compile/link bstrwrap.cpp. For both, there may be a need to make choices about feature configuration as described in the "Configurable compilation options" in the section above. Other files that are included in this archive are: license.txt - The BSD license for Bstrlib gpl.txt - The GPL version 2 security.txt - A security statement useful for auditting Bstrlib porting.txt - A guide to porting Bstrlib bstrlib.txt - This file =============================================================================== The functions ------------- extern bstring bfromcstr (const char * str); Take a standard C library style '\0' terminated char buffer and generate a bstring with the same contents as the char buffer. If an error occurs NULL is returned. So for example: bstring b = bfromcstr ("Hello"); if (!b) { fprintf (stderr, "Out of memory"); } else { puts ((char *) b->data); } .......................................................................... extern bstring bfromcstralloc (int mlen, const char * str); Create a bstring which contains the contents of the '\0' terminated char * buffer str. The memory buffer backing the bstring is at least mlen characters in length. The buffer is also at least size required to hold the string with the '\0' terminator. If an error occurs NULL is returned. So for example: bstring b = bfromcstralloc (64, someCstr); if (b) b->data[63] = 'x'; The idea is that this will set the 64th character of b to 'x' if it is at least 64 characters long otherwise do nothing. And we know this is well defined so long as b was successfully created, since it will have been allocated with at least 64 characters. .......................................................................... extern bstring bfromcstrrangealloc (int minl, int maxl, const char* str); Create a bstring which contains the contents of the '\0' terminated char * buffer str. The memory buffer backing the string is at least minl characters in length, but an attempt is made to allocate up to maxl characters. The buffer is also at least size required to hold the string with the '\0' terminator. If an error occurs NULL is returned. So for example: bstring b = bfromcstrrangealloc (0, 128, "Hello."); if (b) b->data[5] = '!'; The idea is that this will set the 6th character of b to '!' if it was allocated otherwise do nothing. And we know this is well defined so long as b was successfully created, since it will have been allocated with at least 7 (strlen("Hello.")) characters. .......................................................................... extern bstring blk2bstr (const void * blk, int len); Create a bstring whose contents are described by the contiguous buffer pointing to by blk with a length of len bytes. Note that this function creates a copy of the data in blk, rather than simply referencing it. Compare with the blk2tbstr macro. If an error occurs NULL is returned. .......................................................................... extern char * bstr2cstr (const_bstring s, char z); Create a '\0' terminated char buffer which contains the contents of the bstring s, except that any contained '\0' characters are converted to the character in z. This returned value should be freed with bcstrfree(), by the caller. If an error occurs NULL is returned. .......................................................................... extern int bcstrfree (char * s); Frees a C-string generated by bstr2cstr (). This is normally unnecessary since it just wraps a call to free (), however, if malloc () and free () have been redefined as a macros within the bstrlib module (via macros in the memdbg.h backdoor) with some difference in behaviour from the std library functions, then this allows a correct way of freeing the memory that allows higher level code to be independent from these macro redefinitions. .......................................................................... extern bstring bstrcpy (const_bstring b1); Make a copy of the passed in bstring. The copied bstring is returned if there is no error, otherwise NULL is returned. .......................................................................... extern int bassign (bstring a, const_bstring b); Overwrite the bstring a with the contents of bstring b. Note that the bstring a must be a well defined and writable bstring. If an error occurs BSTR_ERR is returned and a is not overwritten. .......................................................................... int bassigncstr (bstring a, const char * str); Overwrite the string a with the contents of char * string str. Note that the bstring a must be a well defined and writable bstring. If an error occurs BSTR_ERR is returned and a may be partially overwritten. .......................................................................... int bassignblk (bstring a, const void * s, int len); Overwrite the string a with the contents of the block (s, len). Note that the bstring a must be a well defined and writable bstring. If an error occurs BSTR_ERR is returned and a is not overwritten. .......................................................................... extern int bassignmidstr (bstring a, const_bstring b, int left, int len); Overwrite the bstring a with the middle of contents of bstring b starting from position left and running for a length len. left and len are clamped to the ends of b as with the function bmidstr. Note that the bstring a must be a well defined and writable bstring. If an error occurs BSTR_ERR is returned and a is not overwritten. .......................................................................... extern bstring bmidstr (const_bstring b, int left, int len); Create a bstring which is the substring of b starting from position left and running for a length len (clamped by the end of the bstring b.) If there was no error, the value of this constructed bstring is returned otherwise NULL is returned. .......................................................................... extern int bdelete (bstring s1, int pos, int len); Removes characters from pos to pos+len-1 and shifts the tail of the bstring starting from pos+len to pos. len must be positive for this call to have any effect. The section of the bstring described by (pos, len) is clamped to boundaries of the bstring b. The value BSTR_OK is returned if the operation is successful, otherwise BSTR_ERR is returned. .......................................................................... extern int bconcat (bstring b0, const_bstring b1); Concatenate the bstring b1 to the end of bstring b0. The value BSTR_OK is returned if the operation is successful, otherwise BSTR_ERR is returned. .......................................................................... extern int bconchar (bstring b, char c); Concatenate the character c to the end of bstring b. The value BSTR_OK is returned if the operation is successful, otherwise BSTR_ERR is returned. .......................................................................... extern int bcatcstr (bstring b, const char * s); Concatenate the char * string s to the end of bstring b. The value BSTR_OK is returned if the operation is successful, otherwise BSTR_ERR is returned. .......................................................................... extern int bcatblk (bstring b, const void * s, int len); Concatenate a fixed length buffer (s, len) to the end of bstring b. The value BSTR_OK is returned if the operation is successful, otherwise BSTR_ERR is returned. .......................................................................... extern int biseq (const_bstring b0, const_bstring b1); Compare the bstring b0 and b1 for equality. If the bstrings differ, 0 is returned, if the bstrings are the same, 1 is returned, if there is an error, -1 is returned. If the length of the bstrings are different, this function has O(1) complexity. Contained '\0' characters are not treated as a termination character. Note that the semantics of biseq are not completely compatible with bstrcmp because of its different treatment of the '\0' character. .......................................................................... extern int bisstemeqblk (const_bstring b, const void * blk, int len); Compare beginning of bstring b0 with a block of memory of length len for equality. If the beginning of b0 differs from the memory block (or if b0 is too short), 0 is returned, if the bstrings are the same, 1 is returned, if there is an error, -1 is returned. .......................................................................... extern int biseqcaseless (const_bstring b0, const_bstring b1); Compare two bstrings for equality without differentiating between case. If the bstrings differ other than in case, 0 is returned, if the bstrings are the same, 1 is returned, if there is an error, -1 is returned. If the length of the bstrings are different, this function is O(1). '\0' termination characters are not treated in any special way. .......................................................................... extern int biseqcaselessblk (const_bstring b, const void * blk, int len); Compare content of b and the array of bytes in blk for length len for equality without differentiating between character case. If the content differs other than in case, 0 is returned, if, ignoring case, the content is the same, 1 is returned, if there is an error, -1 is returned. If the length of the strings are different, this function is O(1). '\0' termination characters are not treated in any special way. .......................................................................... extern int bisstemeqcaselessblk (const_bstring b0, const void * blk, int len); Compare beginning of bstring b0 with a block of memory of length len without differentiating between case for equality. If the beginning of b0 differs from the memory block other than in case (or if b0 is too short), 0 is returned, if the bstrings are the same, 1 is returned, if there is an error, -1 is returned. .......................................................................... int biseqblk (const_bstring b, const void * blk, int len) Compare the string b with the character block blk of length len. If the content differs, 0 is returned, if the content is the same, 1 is returned, if there is an error, -1 is returned. If the length of the strings are different, this function is O(1). '\0' characters are not treated in any special way. .......................................................................... extern int biseqcstr (const_bstring b, const char *s); Compare the bstring b and char * bstring s. The C string s must be '\0' terminated at exactly the length of the bstring b, and the contents between the two must be identical with the bstring b with no '\0' characters for the two contents to be considered equal. This is equivalent to the condition that their current contents will be always be equal when comparing them in the same format after converting one or the other. If they are equal 1 is returned, if they are unequal 0 is returned and if there is a detectable error BSTR_ERR is returned. .......................................................................... extern int biseqcstrcaseless (const_bstring b, const char *s); Compare the bstring b and char * string s. The C string s must be '\0' terminated at exactly the length of the bstring b, and the contents between the two must be identical except for case with the bstring b with no '\0' characters for the two contents to be considered equal. This is equivalent to the condition that their current contents will be always be equal ignoring case when comparing them in the same format after converting one or the other. If they are equal, except for case, 1 is returned, if they are unequal regardless of case 0 is returned and if there is a detectable error BSTR_ERR is returned. .......................................................................... extern int bstrcmp (const_bstring b0, const_bstring b1); Compare the bstrings b0 and b1 for ordering. If there is an error, SHRT_MIN is returned, otherwise a value less than or greater than zero, indicating that the bstring pointed to by b0 is lexicographically less than or greater than the bstring pointed to by b1 is returned. If the bstring lengths are unequal but the characters up until the length of the shorter are equal then a value less than, or greater than zero, indicating that the bstring pointed to by b0 is shorter or longer than the bstring pointed to by b1 is returned. 0 is returned if and only if the two bstrings are the same. If the length of the bstrings are different, this function is O(n). Like its standard C library counter part, the comparison does not proceed past any '\0' termination characters encountered. The seemingly odd error return value, merely provides slightly more granularity than the undefined situation given in the C library function strcmp. The function otherwise behaves very much like strcmp(). Note that the semantics of bstrcmp are not completely compatible with biseq because of its different treatment of the '\0' termination character. .......................................................................... extern int bstrncmp (const_bstring b0, const_bstring b1, int n); Compare the bstrings b0 and b1 for ordering for at most n characters. If there is an error, SHRT_MIN is returned, otherwise a value is returned as if b0 and b1 were first truncated to at most n characters then bstrcmp was called with these new bstrings are paremeters. If the length of the bstrings are different, this function is O(n). Like its standard C library counter part, the comparison does not proceed past any '\0' termination characters encountered. The seemingly odd error return value, merely provides slightly more granularity than the undefined situation given in the C library function strncmp. The function otherwise behaves very much like strncmp(). .......................................................................... extern int bstricmp (const_bstring b0, const_bstring b1); Compare two bstrings without differentiating between case. The return value is the difference of the values of the characters where the two bstrings first differ, otherwise 0 is returned indicating that the bstrings are equal. If the lengths are different, then a difference from 0 is given, but if the first extra character is '\0', then it is taken to be the value UCHAR_MAX+1. .......................................................................... extern int bstrnicmp (const_bstring b0, const_bstring b1, int n); Compare two bstrings without differentiating between case for at most n characters. If the position where the two bstrings first differ is before the nth position, the return value is the difference of the values of the characters, otherwise 0 is returned. If the lengths are different and less than n characters, then a difference from 0 is given, but if the first extra character is '\0', then it is taken to be the value UCHAR_MAX+1. .......................................................................... extern int bdestroy (bstring b); Deallocate the bstring passed. Passing NULL in as a parameter will have no effect. Note that both the header and the data portion of the bstring will be freed. No other bstring function which modifies one of its parameters will free or reallocate the header. Because of this, in general, bdestroy cannot be called on any declared struct tagbstring even if it is not write protected. A bstring which is write protected cannot be destroyed via the bdestroy call. Any attempt to do so will result in no action taken, and BSTR_ERR will be returned. Note to C++ users: Passing in a CBString cast to a bstring will lead to undefined behavior (free will be called on the header, rather than the CBString destructor.) Instead just use the ordinary C++ language facilities to dealloc a CBString. .......................................................................... extern int binstr (const_bstring s1, int pos, const_bstring s2); Search for the bstring s2 in s1 starting at position pos and looking in a forward (increasing) direction. If it is found then it returns with the first position after pos where it is found, otherwise it returns BSTR_ERR. The algorithm used is brute force; O(m*n). .......................................................................... extern int binstrr (const_bstring s1, int pos, const_bstring s2); Search for the bstring s2 in s1 starting at position pos and looking in a backward (decreasing) direction. If it is found then it returns with the first position after pos where it is found, otherwise return BSTR_ERR. Note that the current position at pos is tested as well -- so to be disjoint from a previous forward search it is recommended that the position be backed up (decremented) by one position. The algorithm used is brute force; O(m*n). .......................................................................... extern int binstrcaseless (const_bstring s1, int pos, const_bstring s2); Search for the bstring s2 in s1 starting at position pos and looking in a forward (increasing) direction but without regard to case. If it is found then it returns with the first position after pos where it is found, otherwise it returns BSTR_ERR. The algorithm used is brute force; O(m*n). .......................................................................... extern int binstrrcaseless (const_bstring s1, int pos, const_bstring s2); Search for the bstring s2 in s1 starting at position pos and looking in a backward (decreasing) direction but without regard to case. If it is found then it returns with the first position after pos where it is found, otherwise return BSTR_ERR. Note that the current position at pos is tested as well -- so to be disjoint from a previous forward search it is recommended that the position be backed up (decremented) by one position. The algorithm used is brute force; O(m*n). .......................................................................... extern int binchr (const_bstring b0, int pos, const_bstring b1); Search for the first position in b0 starting from pos or after, in which one of the characters in b1 is found. This function has an execution time of O(b0->slen + b1->slen). If such a position does not exist in b0, then BSTR_ERR is returned. .......................................................................... extern int binchrr (const_bstring b0, int pos, const_bstring b1); Search for the last position in b0 no greater than pos, in which one of the characters in b1 is found. This function has an execution time of O(b0->slen + b1->slen). If such a position does not exist in b0, then BSTR_ERR is returned. .......................................................................... extern int bninchr (const_bstring b0, int pos, const_bstring b1); Search for the first position in b0 starting from pos or after, in which none of the characters in b1 is found and return it. This function has an execution time of O(b0->slen + b1->slen). If such a position does not exist in b0, then BSTR_ERR is returned. .......................................................................... extern int bninchrr (const_bstring b0, int pos, const_bstring b1); Search for the last position in b0 no greater than pos, in which none of the characters in b1 is found and return it. This function has an execution time of O(b0->slen + b1->slen). If such a position does not exist in b0, then BSTR_ERR is returned. .......................................................................... extern int bstrchr (const_bstring b, int c); Search for the character c in the bstring b forwards from the start of the bstring. Returns the position of the found character or BSTR_ERR if it is not found. NOTE: This has been implemented as a macro on top of bstrchrp (). .......................................................................... extern int bstrrchr (const_bstring b, int c); Search for the character c in the bstring b backwards from the end of the bstring. Returns the position of the found character or BSTR_ERR if it is not found. NOTE: This has been implemented as a macro on top of bstrrchrp (). .......................................................................... extern int bstrchrp (const_bstring b, int c, int pos); Search for the character c in b forwards from the position pos (inclusive). Returns the position of the found character or BSTR_ERR if it is not found. .......................................................................... extern int bstrrchrp (const_bstring b, int c, int pos); Search for the character c in b backwards from the position pos in bstring (inclusive). Returns the position of the found character or BSTR_ERR if it is not found. .......................................................................... extern int bsetstr (bstring b0, int pos, const_bstring b1, unsigned char fill); Overwrite the bstring b0 starting at position pos with the bstring b1. If the position pos is past the end of b0, then the character "fill" is appended as necessary to make up the gap between the end of b0 and pos. If b1 is NULL, it behaves as if it were a 0-length bstring. The value BSTR_OK is returned if the operation is successful, otherwise BSTR_ERR is returned. .......................................................................... extern int binsert (bstring s1, int pos, const_bstring s2, unsigned char fill); Inserts the bstring s2 into s1 at position pos. If the position pos is past the end of s1, then the character "fill" is appended as necessary to make up the gap between the end of s1 and pos. The value BSTR_OK is returned if the operation is successful, otherwise BSTR_ERR is returned. .......................................................................... int binsertblk (bstring b, int pos, const void * blk, int len, unsigned char fill) Inserts the block of characters at blk with length len into b at position pos. If the position pos is past the end of b, then the character "fill" is appended as necessary to make up the gap between the end of b1 and pos. Unlike bsetstr, binsert does not allow b2 to be NULL. .......................................................................... extern int binsertch (bstring s1, int pos, int len, unsigned char fill); Inserts the character fill repeatedly into s1 at position pos for a length len. If the position pos is past the end of s1, then the character "fill" is appended as necessary to make up the gap between the end of s1 and the position pos + len (exclusive). The value BSTR_OK is returned if the operation is successful, otherwise BSTR_ERR is returned. .......................................................................... extern int breplace (bstring b1, int pos, int len, const_bstring b2, unsigned char fill); Replace a section of a bstring from pos for a length len with the bstring b2. If the position pos is past the end of b1 then the character "fill" is appended as necessary to make up the gap between the end of b1 and pos. .......................................................................... extern int bfindreplace (bstring b, const_bstring find, const_bstring replace, int position); Replace all occurrences of the find substring with a replace bstring after a given position in the bstring b. The find bstring must have a length > 0 otherwise BSTR_ERR is returned. This function does not perform recursive per character replacement; that is to say successive searches resume at the position after the last replace. So for example: bfindreplace (a0 = bfromcstr("aabaAb"), a1 = bfromcstr("a"), a2 = bfromcstr("aa"), 0); Should result in changing a0 to "aaaabaaAb". This function performs exactly (b->slen - position) bstring comparisons, and data movement is bounded above by character volume equivalent to size of the output bstring. .......................................................................... extern int bfindreplacecaseless (bstring b, const_bstring find, const_bstring replace, int position); Replace all occurrences of the find substring, ignoring case, with a replace bstring after a given position in the bstring b. The find bstring must have a length > 0 otherwise BSTR_ERR is returned. This function does not perform recursive per character replacement; that is to say successive searches resume at the position after the last replace. So for example: bfindreplacecaseless (a0 = bfromcstr("AAbaAb"), a1 = bfromcstr("a"), a2 = bfromcstr("aa"), 0); Should result in changing a0 to "aaaabaaaab". This function performs exactly (b->slen - position) bstring comparisons, and data movement is bounded above by character volume equivalent to size of the output bstring. .......................................................................... extern int balloc (bstring b, int length); Increase the allocated memory backing the data buffer for the bstring b to a length of at least length. If the memory backing the bstring b is already large enough, not action is performed. This has no effect on the bstring b that is visible to the bstring API. Usually this function will only be used when a minimum buffer size is required coupled with a direct access to the ->data member of the bstring structure. Be warned that like any other bstring function, the bstring must be well defined upon entry to this function. I.e., doing something like: b->slen *= 2; /* ?? Most likely incorrect */ balloc (b, b->slen); is invalid, and should be implemented as: int t; if (BSTR_OK == balloc (b, t = (b->slen * 2))) b->slen = t; This function will return with BSTR_ERR if b is not detected as a valid bstring or length is not greater than 0, otherwise BSTR_OK is returned. .......................................................................... extern int ballocmin (bstring b, int length); Change the amount of memory backing the bstring b to at least length. This operation will never truncate the bstring data including the extra terminating '\0' and thus will not decrease the length to less than b->slen + 1. Note that repeated use of this function may cause performance problems (realloc may be called on the bstring more than the O(log(INT_MAX)) times). This function will return with BSTR_ERR if b is not detected as a valid bstring or length is not greater than 0, otherwise BSTR_OK is returned. So for example: if (BSTR_OK == ballocmin (b, 64)) b->data[63] = 'x'; The idea is that this will set the 64th character of b to 'x' if it is at least 64 characters long otherwise do nothing. And we know this is well defined so long as the ballocmin call was successfully, since it will ensure that b has been allocated with at least 64 characters. .......................................................................... int btrunc (bstring b, int n); Truncate the bstring to at most n characters. This function will return with BSTR_ERR if b is not detected as a valid bstring or n is less than 0, otherwise BSTR_OK is returned. .......................................................................... extern int bpattern (bstring b, int len); Replicate the starting bstring, b, end to end repeatedly until it surpasses len characters, then chop the result to exactly len characters. This function operates in-place. This function will return with BSTR_ERR if b is NULL or of length 0, otherwise BSTR_OK is returned. .......................................................................... extern int btoupper (bstring b); Convert contents of bstring to upper case. This function will return with BSTR_ERR if b is NULL or of length 0, otherwise BSTR_OK is returned. .......................................................................... extern int btolower (bstring b); Convert contents of bstring to lower case. This function will return with BSTR_ERR if b is NULL or of length 0, otherwise BSTR_OK is returned. .......................................................................... extern int bltrimws (bstring b); Delete whitespace contiguous from the left end of the bstring. This function will return with BSTR_ERR if b is NULL or of length 0, otherwise BSTR_OK is returned. .......................................................................... extern int brtrimws (bstring b); Delete whitespace contiguous from the right end of the bstring. This function will return with BSTR_ERR if b is NULL or of length 0, otherwise BSTR_OK is returned. .......................................................................... extern int btrimws (bstring b); Delete whitespace contiguous from both ends of the bstring. This function will return with BSTR_ERR if b is NULL or of length 0, otherwise BSTR_OK is returned. .......................................................................... extern struct bstrList* bstrListCreate (void); Create an empty struct bstrList. The struct bstrList output structure is declared as follows: struct bstrList { int qty, mlen; bstring * entry; }; The entry field actually is an array with qty number entries. The mlen record counts the maximum number of bstring's for which there is memory in the entry record. The Bstrlib API does *NOT* include a comprehensive set of functions for full management of struct bstrList in an abstracted way. The reason for this is because aliasing semantics of the list are best left to the user of this function, and performance varies wildly depending on the assumptions made. For a complete list of bstring data type it is recommended that the C++ public std::vector be used, since its semantics are usage are more standard. .......................................................................... extern int bstrListDestroy (struct bstrList * sl); Destroy a struct bstrList structure that was returned by the bsplit function. Note that this will destroy each bstring in the ->entry array as well. See bstrListCreate() above for structure of struct bstrList. .......................................................................... extern int bstrListAlloc (struct bstrList * sl, int msz); Ensure that there is memory for at least msz number of entries for the list. .......................................................................... extern int bstrListAllocMin (struct bstrList * sl, int msz); Try to allocate the minimum amount of memory for the list to include at least msz entries or sl->qty whichever is greater. .......................................................................... extern struct bstrList * bsplit (bstring str, unsigned char splitChar); Create an array of sequential substrings from str divided by the character splitChar. Successive occurrences of the splitChar will be divided by empty bstring entries, following the semantics from the Python programming language. To reclaim the memory from this output structure, bstrListDestroy () should be called. See bstrListCreate() above for structure of struct bstrList. .......................................................................... extern struct bstrList * bsplits (bstring str, const_bstring splitStr); Create an array of sequential substrings from str divided by any character contained in splitStr. An empty splitStr causes a single entry bstrList containing a copy of str to be returned. See bstrListCreate() above for structure of struct bstrList. .......................................................................... extern struct bstrList * bsplitstr (bstring str, const_bstring splitStr); Create an array of sequential substrings from str divided by the entire substring splitStr. An empty splitStr causes a single entry bstrList containing a copy of str to be returned. See bstrListCreate() above for structure of struct bstrList. .......................................................................... extern bstring bjoin (const struct bstrList * bl, const_bstring sep); Join the entries of a bstrList into one bstring by sequentially concatenating them with the sep bstring in between. If sep is NULL, it is treated as if it were the empty bstring. Note that: bjoin (l = bsplit (b, s->data[0]), s); should result in a copy of b, if s->slen is 1. If there is an error NULL is returned, otherwise a bstring with the correct result is returned. See bstrListCreate() above for structure of struct bstrList. .......................................................................... bstring bjoinblk (const struct bstrList * bl, void * blk, int len); Join the entries of a bstrList into one bstring by sequentially concatenating them with the content from blk for length len in between. If there is an error NULL is returned, otherwise a bstring with the correct result is returned. .......................................................................... extern int bsplitcb (const_bstring str, unsigned char splitChar, int pos, int (* cb) (void * parm, int ofs, int len), void * parm); Iterate the set of disjoint sequential substrings over str starting at position pos divided by the character splitChar. The parm passed to bsplitcb is passed on to cb. If the function cb returns a value < 0, then further iterating is halted and this value is returned by bsplitcb. Note: Non-destructive modification of str from within the cb function while performing this split is not undefined. bsplitcb behaves in sequential lock step with calls to cb. I.e., after returning from a cb that return a non-negative integer, bsplitcb continues from the position 1 character after the last detected split character and it will halt immediately if the length of str falls below this point. However, if the cb function destroys str, then it *must* return with a negative value, otherwise bsplitcb will continue in an undefined manner. This function is provided as an incremental alternative to bsplit that is abortable and which does not impose additional memory allocation. .......................................................................... extern int bsplitscb (const_bstring str, const_bstring splitStr, int pos, int (* cb) (void * parm, int ofs, int len), void * parm); Iterate the set of disjoint sequential substrings over str starting at position pos divided by any of the characters in splitStr. An empty splitStr causes the whole str to be iterated once. The parm passed to bsplitcb is passed on to cb. If the function cb returns a value < 0, then further iterating is halted and this value is returned by bsplitcb. Note: Non-destructive modification of str from within the cb function while performing this split is not undefined. bsplitscb behaves in sequential lock step with calls to cb. I.e., after returning from a cb that return a non-negative integer, bsplitscb continues from the position 1 character after the last detected split character and it will halt immediately if the length of str falls below this point. However, if the cb function destroys str, then it *must* return with a negative value, otherwise bsplitscb will continue in an undefined manner. This function is provided as an incremental alternative to bsplits that is abortable and which does not impose additional memory allocation. .......................................................................... extern int bsplitstrcb (const_bstring str, const_bstring splitStr, int pos, int (* cb) (void * parm, int ofs, int len), void * parm); Iterate the set of disjoint sequential substrings over str starting at position pos divided by the entire substring splitStr. An empty splitStr causes each character of str to be iterated. The parm passed to bsplitcb is passed on to cb. If the function cb returns a value < 0, then further iterating is halted and this value is returned by bsplitcb. Note: Non-destructive modification of str from within the cb function while performing this split is not undefined. bsplitstrcb behaves in sequential lock step with calls to cb. I.e., after returning from a cb that return a non-negative integer, bsplitstrcb continues from the position 1 character after the last detected split character and it will halt immediately if the length of str falls below this point. However, if the cb function destroys str, then it *must* return with a negative value, otherwise bsplitscb will continue in an undefined manner. This function is provided as an incremental alternative to bsplitstr that is abortable and which does not impose additional memory allocation. .......................................................................... extern bstring bformat (const char * fmt, ...); Takes the same parameters as printf (), but rather than outputting results to stdio, it forms a bstring which contains what would have been output. Note that if there is an early generation of a '\0' character, the bstring will be truncated to this end point. Note that %s format tokens correspond to '\0' terminated char * buffers, not bstrings. To print a bstring, first dereference data element of the the bstring: /* b1->data needs to be '\0' terminated, so tagbstrings generated by blk2tbstr () might not be suitable. */ b0 = bformat ("Hello, %s", b1->data); Note that if the BSTRLIB_NOVSNP macro has been set when bstrlib has been compiled the bformat function is not present. .......................................................................... extern int bformata (bstring b, const char * fmt, ...); In addition to the initial output buffer b, bformata takes the same parameters as printf (), but rather than outputting results to stdio, it appends the results to the initial bstring parameter. Note that if there is an early generation of a '\0' character, the bstring will be truncated to this end point. Note that %s format tokens correspond to '\0' terminated char * buffers, not bstrings. To print a bstring, first dereference data element of the the bstring: /* b1->data needs to be '\0' terminated, so tagbstrings generated by blk2tbstr () might not be suitable. */ bformata (b0 = bfromcstr ("Hello"), ", %s", b1->data); Note that if the BSTRLIB_NOVSNP macro has been set when bstrlib has been compiled the bformata function is not present. .......................................................................... extern int bassignformat (bstring b, const char * fmt, ...); After the first parameter, it takes the same parameters as printf (), but rather than outputting results to stdio, it outputs the results to the bstring parameter b. Note that if there is an early generation of a '\0' character, the bstring will be truncated to this end point. Note that %s format tokens correspond to '\0' terminated char * buffers, not bstrings. To print a bstring, first dereference data element of the the bstring: /* b1->data needs to be '\0' terminated, so tagbstrings generated by blk2tbstr () might not be suitable. */ bassignformat (b0 = bfromcstr ("Hello"), ", %s", b1->data); Note that if the BSTRLIB_NOVSNP macro has been set when bstrlib has been compiled the bassignformat function is not present. .......................................................................... extern int bvcformata (bstring b, int count, const char * fmt, va_list arglist); The bvcformata function formats data under control of the format control string fmt and attempts to append the result to b. The fmt parameter is the same as that of the printf function. The variable argument list is replaced with arglist, which has been initialized by the va_start macro. The size of the output is upper bounded by count. If the required output exceeds count, the string b is not augmented with any contents and a value below BSTR_ERR is returned. If a value below -count is returned then it is recommended that the negative of this value be used as an update to the count in a subsequent pass. On other errors, such as running out of memory, parameter errors or numeric wrap around BSTR_ERR is returned. BSTR_OK is returned when the output is successfully generated and appended to b. Note: There is no sanity checking of arglist, and this function is destructive of the contents of b from the b->slen point onward. If there is an early generation of a '\0' character, the bstring will be truncated to this end point. Although this function is part of the external API for Bstrlib, the interface and semantics (length limitations, and unusual return codes) are fairly atypical. The real purpose for this function is to provide an engine for the bvformata macro. Note that if the BSTRLIB_NOVSNP macro has been set when bstrlib has been compiled the bvcformata function is not present. .......................................................................... extern bstring bread (bNread readPtr, void * parm); typedef size_t (* bNread) (void *buff, size_t elsize, size_t nelem, void *parm); Read an entire stream into a bstring, verbatum. The readPtr function pointer is compatible with fread sematics, except that it need not obtain the stream data from a file. The intention is that parm would contain the stream data context/state required (similar to the role of the FILE* I/O stream parameter of fread.) Abstracting the block read function allows for block devices other than file streams to be read if desired. Note that there is an ANSI compatibility issue if "fread" is used directly; see the ANSI issues section below. .......................................................................... extern int breada (bstring b, bNread readPtr, void * parm); Read an entire stream and append it to a bstring, verbatum. Behaves like bread, except that it appends it results to the bstring b. BSTR_ERR is returned on error, otherwise 0 is returned. .......................................................................... extern bstring bgets (bNgetc getcPtr, void * parm, char terminator); typedef int (* bNgetc) (void * parm); Read a bstring from a stream. As many bytes as is necessary are read until the terminator is consumed or no more characters are available from the stream. If read from the stream, the terminator character will be appended to the end of the returned bstring. The getcPtr function must have the same semantics as the fgetc C library function (i.e., returning an integer whose value is negative when there are no more characters available, otherwise the value of the next available unsigned character from the stream.) The intention is that parm would contain the stream data context/state required (similar to the role of the FILE* I/O stream parameter of fgets.) If no characters are read, or there is some other detectable error, NULL is returned. bgets will never call the getcPtr function more often than necessary to construct its output (including a single call, if required, to determine that the stream contains no more characters.) Abstracting the character stream function and terminator character allows for different stream devices and string formats other than '\n' terminated lines in a file if desired (consider \032 terminated email messages, in a UNIX mailbox for example.) For files, this function can be used analogously as fgets as follows: fp = fopen ( ... ); if (fp) b = bgets ((bNgetc) fgetc, fp, '\n'); (Note that only one terminator character can be used, and that '\0' is not assumed to terminate the stream in addition to the terminator character. This is consistent with the semantics of fgets.) .......................................................................... extern int bgetsa (bstring b, bNgetc getcPtr, void * parm, char terminator); Read from a stream and concatenate to a bstring. Behaves like bgets, except that it appends it results to the bstring b. The value 1 is returned if no characters are read before a negative result is returned from getcPtr. Otherwise BSTR_ERR is returned on error, and 0 is returned in other normal cases. .......................................................................... extern int bassigngets (bstring b, bNgetc getcPtr, void * parm, char terminator); Read from a stream and concatenate to a bstring. Behaves like bgets, except that it assigns the results to the bstring b. The value 1 is returned if no characters are read before a negative result is returned from getcPtr. Otherwise BSTR_ERR is returned on error, and 0 is returned in other normal cases. .......................................................................... extern struct bStream * bsopen (bNread readPtr, void * parm); Wrap a given open stream (described by a fread compatible function pointer and stream handle) into an open bStream suitable for the bstring library streaming functions. .......................................................................... extern void * bsclose (struct bStream * s); Close the bStream, and return the handle to the stream that was originally used to open the given stream. If s is NULL or detectably invalid, NULL will be returned. .......................................................................... extern int bsbufflength (struct bStream * s, int sz); Set the length of the buffer used by the bStream. If sz is the macro BSTR_BS_BUFF_LENGTH_GET (which is 0), the length is not set. If s is NULL or sz is negative, the function will return with BSTR_ERR, otherwise this function returns with the previous length. .......................................................................... extern int bsreadln (bstring r, struct bStream * s, char terminator); Read a bstring terminated by the terminator character or the end of the stream from the bStream (s) and return it into the parameter r. The matched terminator, if found, appears at the end of the line read. If the stream has been exhausted of all available data, before any can be read, BSTR_ERR is returned. This function may read additional characters into the stream buffer from the core stream that are not returned, but will be retained for subsequent read operations. When reading from high speed streams, this function can perform significantly faster than bgets. .......................................................................... extern int bsreadlna (bstring r, struct bStream * s, char terminator); Read a bstring terminated by the terminator character or the end of the stream from the bStream (s) and concatenate it to the parameter r. The matched terminator, if found, appears at the end of the line read. If the stream has been exhausted of all available data, before any can be read, BSTR_ERR is returned. This function may read additional characters into the stream buffer from the core stream that are not returned, but will be retained for subsequent read operations. When reading from high speed streams, this function can perform significantly faster than bgets. .......................................................................... extern int bsreadlns (bstring r, struct bStream * s, bstring terminators); Read a bstring terminated by any character in the terminators bstring or the end of the stream from the bStream (s) and return it into the parameter r. This function may read additional characters from the core stream that are not returned, but will be retained for subsequent read operations. .......................................................................... extern int bsreadlnsa (bstring r, struct bStream * s, bstring terminators); Read a bstring terminated by any character in the terminators bstring or the end of the stream from the bStream (s) and concatenate it to the parameter r. If the stream has been exhausted of all available data, before any can be read, BSTR_ERR is returned. This function may read additional characters from the core stream that are not returned, but will be retained for subsequent read operations. .......................................................................... extern int bsread (bstring r, struct bStream * s, int n); Read a bstring of length n (or, if it is fewer, as many bytes as is remaining) from the bStream. This function will read the minimum required number of additional characters from the core stream. When the stream is at the end of the file BSTR_ERR is returned, otherwise BSTR_OK is returned. .......................................................................... extern int bsreada (bstring r, struct bStream * s, int n); Read a bstring of length n (or, if it is fewer, as many bytes as is remaining) from the bStream and concatenate it to the parameter r. This function will read the minimum required number of additional characters from the core stream. When the stream is at the end of the file BSTR_ERR is returned, otherwise BSTR_OK is returned. .......................................................................... extern int bsunread (struct bStream * s, const_bstring b); Insert a bstring into the bStream at the current position. These characters will be read prior to those that actually come from the core stream. .......................................................................... extern int bspeek (bstring r, const struct bStream * s); Return the number of currently buffered characters from the bStream that will be read prior to reads from the core stream, and append it to the the parameter r. .......................................................................... extern int bssplitscb (struct bStream * s, const_bstring splitStr, int (* cb) (void * parm, int ofs, const_bstring entry), void * parm); Iterate the set of disjoint sequential substrings over the stream s divided by any character from the bstring splitStr. The parm passed to bssplitscb is passed on to cb. If the function cb returns a value < 0, then further iterating is halted and this return value is returned by bssplitscb. Note: At the point of calling the cb function, the bStream pointer is pointed exactly at the position right after having read the split character. The cb function can act on the stream by causing the bStream pointer to move, and bssplitscb will continue by starting the next split at the position of the pointer after the return from cb. However, if the cb causes the bStream s to be destroyed then the cb must return with a negative value, otherwise bssplitscb will continue in an undefined manner. This function is provided as way to incrementally parse through a file or other generic stream that in total size may otherwise exceed the practical or desired memory available. As with the other split callback based functions this is abortable and does not impose additional memory allocation. .......................................................................... extern int bssplitstrcb (struct bStream * s, const_bstring splitStr, int (* cb) (void * parm, int ofs, const_bstring entry), void * parm); Iterate the set of disjoint sequential substrings over the stream s divided by the entire substring splitStr. The parm passed to bssplitstrcb is passed on to cb. If the function cb returns a value < 0, then further iterating is halted and this return value is returned by bssplitstrcb. Note: At the point of calling the cb function, the bStream pointer is pointed exactly at the position right after having read the split character. The cb function can act on the stream by causing the bStream pointer to move, and bssplitstrcb will continue by starting the next split at the position of the pointer after the return from cb. However, if the cb causes the bStream s to be destroyed then the cb must return with a negative value, otherwise bssplitscb will continue in an undefined manner. This function is provided as way to incrementally parse through a file or other generic stream that in total size may otherwise exceed the practical or desired memory available. As with the other split callback based functions this is abortable and does not impose additional memory allocation. .......................................................................... extern int bseof (const struct bStream * s); Return the defacto "EOF" (end of file) state of a stream (1 if the bStream is in an EOF state, 0 if not, and BSTR_ERR if stream is closed or detectably erroneous.) When the readPtr callback returns a value <= 0 the stream reaches its "EOF" state. Note that bunread with non-empty content will essentially turn off this state, and the stream will not be in its "EOF" state so long as its possible to read more data out of it. Also note that the semantics of bseof() are slightly different from something like feof(). I.e., reaching the end of the stream does not necessarily guarantee that bseof() will return with a value indicating that this has happened. bseof() will only return indicating that it has reached the "EOF" and an attempt has been made to read past the end of the bStream. The macros ---------- The macros described below are shown in a prototype form indicating their intended usage. Note that the parameters passed to these macros will be referenced multiple times. As with all macros, programmer care is required to guard against unintended side effects. int blengthe (const_bstring b, int err); Returns the length of the bstring. If the bstring is NULL err is returned. .......................................................................... int blength (const_bstring b); Returns the length of the bstring. If the bstring is NULL, the length returned is 0. .......................................................................... int bchare (const_bstring b, int p, int c); Returns the p'th character of the bstring b. If the position p refers to a position that does not exist in the bstring or the bstring is NULL, then c is returned. .......................................................................... char bchar (const_bstring b, int p); Returns the p'th character of the bstring b. If the position p refers to a position that does not exist in the bstring or the bstring is NULL, then '\0' is returned. .......................................................................... char * bdatae (bstring b, char * err); Returns the char * data portion of the bstring b. If b is NULL, err is returned. .......................................................................... char * bdata (bstring b); Returns the char * data portion of the bstring b. If b is NULL, NULL is returned. .......................................................................... char * bdataofse (bstring b, int ofs, char * err); Returns the char * data portion of the bstring b offset by ofs. If b is NULL, err is returned. .......................................................................... char * bdataofs (bstring b, int ofs); Returns the char * data portion of the bstring b offset by ofs. If b is NULL, NULL is returned. .......................................................................... struct tagbstring var = bsStatic ("..."); The bsStatic macro allows for static declarations of literal string constants as struct tagbstring structures. The resulting tagbstring does not need to be freed or destroyed. Note that this macro is only well defined for string literal arguments. For more general string pointers, use the btfromcstr macro. The resulting struct tagbstring is permanently write protected. Attempts to write to this struct tagbstring from any bstrlib function will lead to BSTR_ERR being returned. Invoking the bwriteallow macro onto this struct tagbstring has no effect. .......................................................................... <- bsStaticBlkParms ("...") The bsStaticBlkParms macro emits a pair of comma seperated parameters corresponding to the block parameters for the block functions in Bstrlib (i.e., blk2bstr, bcatblk, blk2tbstr, bisstemeqblk, bisstemeqcaselessblk.) Note that this macro is only well defined for string literal arguments. Examples: bstring b = blk2bstr (bsStaticBlkParms ("Fast init. ")); bcatblk (b, bsStaticBlkParms ("No frills fast concatenation.")); These are faster than using bfromcstr() and bcatcstr() respectively because the length of the inline string is known as a compile time constant. Also note that seperate struct tagbstring declarations for holding the output of a bsStatic() macro are not required. .......................................................................... void btfromcstr (struct tagbstring& t, const char * s); Fill in the tagbstring t with the '\0' terminated char buffer s. This action is purely reference oriented; no memory management is done. The data member is just assigned s, and slen is assigned the strlen of s. The s parameter is accessed exactly once in this macro. The resulting struct tagbstring is initially write protected. Attempts to write to this struct tagbstring in a write protected state from any bstrlib function will lead to BSTR_ERR being returned. Invoke the bwriteallow on this struct tagbstring to make it writeable (though this requires that s be obtained from a function compatible with malloc.) .......................................................................... void btfromblk (struct tagbstring& t, void * s, int len); Fill in the tagbstring t with the data buffer s with length len. This action is purely reference oriented; no memory management is done. The data member of t is just assigned s, and slen is assigned len. Note that the buffer is not appended with a '\0' character. The s and len parameters are accessed exactly once each in this macro. The resulting struct tagbstring is initially write protected. Attempts to write to this struct tagbstring in a write protected state from any bstrlib function will lead to BSTR_ERR being returned. Invoke the bwriteallow on this struct tagbstring to make it writeable (though this requires that s be obtained from a function compatible with malloc.) .......................................................................... void btfromblkltrimws (struct tagbstring& t, void * s, int len); Fill in the tagbstring t with the data buffer s with length len after it has been left trimmed. This action is purely reference oriented; no memory management is done. The data member of t is just assigned to a pointer inside the buffer s. Note that the buffer is not appended with a '\0' character. The s and len parameters are accessed exactly once each in this macro. The resulting struct tagbstring is permanently write protected. Attempts to write to this struct tagbstring from any bstrlib function will lead to BSTR_ERR being returned. Invoking the bwriteallow macro onto this struct tagbstring has no effect. .......................................................................... void btfromblkrtrimws (struct tagbstring& t, void * s, int len); Fill in the tagbstring t with the data buffer s with length len after it has been right trimmed. This action is purely reference oriented; no memory management is done. The data member of t is just assigned to a pointer inside the buffer s. Note that the buffer is not appended with a '\0' character. The s and len parameters are accessed exactly once each in this macro. The resulting struct tagbstring is permanently write protected. Attempts to write to this struct tagbstring from any bstrlib function will lead to BSTR_ERR being returned. Invoking the bwriteallow macro onto this struct tagbstring has no effect. .......................................................................... void btfromblktrimws (struct tagbstring& t, void * s, int len); Fill in the tagbstring t with the data buffer s with length len after it has been left and right trimmed. This action is purely reference oriented; no memory management is done. The data member of t is just assigned to a pointer inside the buffer s. Note that the buffer is not appended with a '\0' character. The s and len parameters are accessed exactly once each in this macro. The resulting struct tagbstring is permanently write protected. Attempts to write to this struct tagbstring from any bstrlib function will lead to BSTR_ERR being returned. Invoking the bwriteallow macro onto this struct tagbstring has no effect. .......................................................................... void bmid2tbstr (struct tagbstring& t, bstring b, int pos, int len); Fill the tagbstring t with the substring from b, starting from position pos with a length len. The segment is clamped by the boundaries of the bstring b. This action is purely reference oriented; no memory management is done. Note that the buffer is not appended with a '\0' character. Note that the t parameter to this macro may be accessed multiple times. Note that the contents of t will become undefined if the contents of b change or are destroyed. The resulting struct tagbstring is permanently write protected. Attempts to write to this struct tagbstring in a write protected state from any bstrlib function will lead to BSTR_ERR being returned. Invoking the bwriteallow macro on this struct tagbstring will have no effect. .......................................................................... bstring bfromStatic("..."); Allocate a bstring with the contents of a string literal. Returns NULL if an error has occurred (ran out of memory). The string literal parameter is enforced as literal at compile time. .......................................................................... int bcatStatic (bstring b, "..."); Append a string literal to bstring b. Returns 0 if successful, or BSTR_ERR if some error has occurred. The string literal parameter is enforced as literal at compile time. .......................................................................... int binsertStatic (bstring s1, int pos, " ... ", char fill); Inserts the string literal into s1 at position pos. If the position pos is past the end of s1, then the character "fill" is appended as necessary to make up the gap between the end of s1 and pos. The value BSTR_OK is returned if the operation is successful, otherwise BSTR_ERR is returned. .......................................................................... int bassignStatic (bstring b, " ... "); Assign the contents of a string literal to the bstring b. The string literal parameter is enforced as literal at compile time. .......................................................................... int biseqStatic (const_bstring b, " ... "); Compare the string b with the string literal. If the content differs, 0 is returned, if the content is the same, 1 is returned, if there is an error, -1 is returned. If the length of the strings are different, this function is O(1). '\0' characters are not treated in any special way. .......................................................................... int biseqcaselessStatic (const_bstring b, " ... "); Compare content of b and the string literal for equality without differentiating between character case. If the content differs other than in case, 0 is returned, if, ignoring case, the content is the same, 1 is returned, if there is an error, -1 is returned. If the length of the strings are different, this function is O(1). '\0' characters are not treated in any special way. .......................................................................... int bisstemeqStatic (bstring b, " ... "); Compare beginning of bstring b with a string literal for equality. If the beginning of b differs from the memory block (or if b is too short), 0 is returned, if the bstrings are the same, 1 is returned, if there is an error, -1 is returned. The string literal parameter is enforced as literal at compile time. .......................................................................... int bisstemeqcaselessStatic (bstring b, " ... "); Compare beginning of bstring b with a string literal without differentiating between case for equality. If the beginning of b differs from the memory block other than in case (or if b is too short), 0 is returned, if the bstrings are the same, 1 is returned, if there is an error, -1 is returned. The string literal parameter is enforced as literal at compile time. .......................................................................... bstring bjoinStatic (const struct bstrList * bl, " ... "); Join the entries of a bstrList into one bstring by sequentially concatenating them with the string literal in between. If there is an error NULL is returned, otherwise a bstring with the correct result is returned. See bstrListCreate() above for structure of struct bstrList. .......................................................................... void bvformata (int& ret, bstring b, const char * format, lastarg); Append the bstring b with printf like formatting with the format control string, and the arguments taken from the ... list of arguments after lastarg passed to the containing function. If the containing function does not have ... parameters or lastarg is not the last named parameter before the ... then the results are undefined. If successful, the results are appended to b and BSTR_OK is assigned to ret. Otherwise BSTR_ERR is assigned to ret. Example: void dbgerror (FILE * fp, const char * fmt, ...) { int ret; bstring b; bvformata (ret, b = bfromcstr ("DBG: "), fmt, fmt); if (BSTR_OK == ret) fputs ((char *) bdata (b), fp); bdestroy (b); } Note that if the BSTRLIB_NOVSNP macro was set when bstrlib had been compiled the bvformata macro will not link properly. If the BSTRLIB_NOVSNP macro has been set, the bvformata macro will not be available. .......................................................................... void bwriteprotect (struct tagbstring& t); Disallow bstring from being written to via the bstrlib API. Attempts to write to the resulting tagbstring from any bstrlib function will lead to BSTR_ERR being returned. Note: bstrings which are write protected cannot be destroyed via bdestroy. Note to C++ users: Setting a CBString as write protected will not prevent it from being destroyed by the destructor. .......................................................................... void bwriteallow (struct tagbstring& t); Allow bstring to be written to via the bstrlib API. Note that such an action makes the bstring both writable and destroyable. If the bstring is not legitimately writable (as is the case for struct tagbstrings initialized with a bsStatic value), the results of this are undefined. Note that invoking the bwriteallow macro may increase the number of reallocs by one more than necessary for every call to bwriteallow interleaved with any bstring API which writes to this bstring. .......................................................................... int biswriteprotected (struct tagbstring& t); Returns 1 if the bstring is write protected, otherwise 0 is returned. =============================================================================== Unicode functions ----------------- The two modules utf8util.c and buniutil.c implement basic functions for parsing and collecting Unicode data in the UTF8 format. Unicode is described by a sequence of "code points" which are values between 0 and 1114111 inclusive mapped to symbol content corresponding to nearly all the standardized scripts of the world. The semantics of Unicode code points is varied and complicated. The base support of the better string library does not attempt to perform any interpretation of these code points. The better string library solely provides support for iterating through unicode code points, appending and extracting code points to and from bstrings, and parsing UTF8 and UTF16 from raw data. The types cpUcs4 and cpUcs2 respectively are defined as 4 byte and 2 byte encoding formats corresponding to UCS4 and UCS2 respectively. To test if a raw code point is valid, the macro isLegalUnicodeCodePoint() has been defined. The utf8 iterator is defined by struct utf8Iterator. To test if the iterator has more code points to walk through the macro utf8IteratorNoMore() has been defined. To use these functions compile and link utf8util.c and buniutil.c .......................................................................... extern void utf8IteratorInit (struct utf8Iterator* iter, unsigned char* data, int slen); Initialize a unicode utf8 iterator to traverse an array of utf8 encoded code points pointed to by data, with length slen from the start. The iterator iter is only valid for as long as the array it is pointed to is valid and not modified. .......................................................................... extern void utf8IteratorUninit (struct utf8Iterator* iter); Invalidate utf8 iterator. After calling this the iterator iter, should yield false when passed to the utf8IteratorNoMore() macro. .......................................................................... extern cpUcs4 utf8IteratorGetNextCodePoint (struct utf8Iterator* iter, cpUcs4 errCh); Parse code point the iterator is pointing at and advance the iterator to the next code point. If the iterator was pointing at a valid code point the code point is returned, otherwise, errCh will be returned. .......................................................................... extern cpUcs4 utf8IteratorGetCurrCodePoint (struct utf8Iterator* iter, cpUcs4 errCh); Parse code point the iterator is pointing at. If the iterator was pointing at a valid code point the code point is returned, otherwise, errCh will be returned. .......................................................................... extern int utf8ScanBackwardsForCodePoint (unsigned char* msg, int len, int pos, cpUcs4* out); From the position "pos" in the array msg of length len, search for the last position before or at pos where from which a valid Unicode code point can be parsed. If such an offset is found it is returned otherwise a negative value is returned. The code point parsed is put into *out if it is not NULL. .......................................................................... extern int buIsUTF8Content (const_bstring bu); Scan a bstring and determine if it is made entirely of unicode code valid points. If it is, 1 is returned, otherwise 0 is returned. .......................................................................... extern int buAppendBlkUcs4 (bstring b, const cpUcs4* bu, int len, cpUcs4 errCh); Append the code points passed in the UCS4 format (raw numbers) in the array bu of length len. Any unparsable characters are replaced by errCh. If errCh is not a valid Unicode code point, then parsing errors will cause BSTR_ERR to be returned. .......................................................................... extern int buGetBlkUTF16 (cpUcs2* ucs2, int len, cpUcs4 errCh, const_bstring bu, int pos); Convert a string of UTF8 codepoints (bu), skipping the first pos, into a sequence of UTF16 encoded code points. Returns the number of UCS2 16-bit words written to the output. No more than len words are written to the target array ucs2. If any code point in bu is unparsable, it will be translated to errCh. .......................................................................... extern int buAppendBlkUTF16 (bstring bu, const cpUcs2* utf16, int len, cpUcs2* bom, cpUcs4 errCh); Append an array of UCS2 code points (utf16) to UTF8 codepoints (bu). Any invalid code point is replaced by errCh. If errCh is itself not a valid code point, then this translation will halt upon the first error and return BSTR_ERR. Otherwise BSTR_OK is returned. If a byte order mark has been previously read, it may be passed in as bom, otherwise if *bom is set to 0, it will be filled in with the BOM as read from the first character if it is a BOM. =============================================================================== The bstest module ----------------- The bstest module is just a unit test for the bstrlib module. For correct implementations of bstrlib, it should execute with 0 failures being reported. This test should be utilized if modifications/customizations to bstrlib have been performed. It tests each core bstrlib function with bstrings of every mode (read-only, NULL, static and mutable) and ensures that the expected semantics are observed (including results that should indicate an error). It also tests for aliasing support. Passing bstest is a necessary but not a sufficient condition for ensuring the correctness of the bstrlib module. The test module --------------- The test module is just a unit test for the bstrwrap module. For correct implementations of bstrwrap, it should execute with 0 failures being reported. This test should be utilized if modifications/customizations to bstrwrap have been performed. It tests each core bstrwrap function with CBStrings write protected or not and ensures that the expected semantics are observed (including expected exceptions.) Note that exceptions cannot be disabled to run this test. Passing test is a necessary but not a sufficient condition for ensuring the correctness of the bstrwrap module. =============================================================================== Using Bstring and CBString as an alternative to the C library ------------------------------------------------------------- First let us give a table of C library functions and the alternative bstring functions and CBString methods that should be used instead of them. C-library Bstring alternative CBString alternative --------- ------------------- -------------------- gets bgets ::gets strcpy bassign = operator strncpy bassignmidstr ::midstr strcat bconcat += operator strncat bconcat + btrunc += operator + ::trunc strtok bsplit, bsplits ::split sprintf b(assign)format ::format snprintf b(assign)format + btrunc ::format + ::trunc vsprintf bvformata bvformata vsnprintf bvformata + btrunc bvformata + btrunc vfprintf bvformata + fputs use bvformata + fputs strcmp biseq, bstrcmp comparison operators. strncmp bstrncmp, memcmp bstrncmp, memcmp strlen ->slen, blength ::length strdup bstrcpy constructor strset bpattern ::fill strstr binstr ::find strpbrk binchr ::findchr stricmp bstricmp cast & use bstricmp strlwr btolower cast & use btolower strupr btoupper cast & use btoupper strrev bReverse (aux module) cast & use bReverse strchr bstrchr cast & use bstrchr strspnp use strspn use strspn ungetc bsunread bsunread The top 9 C functions listed here are troublesome in that they impose memory management in the calling function. The Bstring and CBstring interfaces have built-in memory management, so there is far less code with far less potential for buffer overrun problems. strtok can only be reliably called as a "leaf" calculation, since it (quite bizarrely) maintains hidden internal state. And gets is well known to be broken no matter what. The Bstrlib alternatives do not suffer from those sorts of problems. The substitute for strncat can be performed with higher performance by using the blk2tbstr macro to create a presized second operand for bconcat. C-library Bstring alternative CBString alternative --------- ------------------- -------------------- strspn strspn acceptable strspn acceptable strcspn strcspn acceptable strcspn acceptable strnset strnset acceptable strnset acceptable printf printf acceptable printf acceptable puts puts acceptable puts acceptable fprintf fprintf acceptable fprintf acceptable fputs fputs acceptable fputs acceptable memcmp memcmp acceptable memcmp acceptable Remember that Bstring (and CBstring) functions will automatically append the '\0' character to the character data buffer. So by simply accessing the data buffer directly, ordinary C string library functions can be called directly on them. Note that bstrcmp is not the same as memcmp in exactly the same way that strcmp is not the same as memcmp. C-library Bstring alternative CBString alternative --------- ------------------- -------------------- fread balloc + fread ::alloc + fread fgets balloc + fgets ::alloc + fgets These are odd ones because of the exact sizing of the buffer required. The Bstring and CBString alternatives requires that the buffers are forced to hold at least the prescribed length, then just use fread or fgets directly. However, typically the automatic memory management of Bstring and CBstring will make the typical use of fgets and fread to read specifically sized strings unnecessary. Implementation Choices ---------------------- Overhead: ......... The bstring library has more overhead versus straight char buffers for most functions. This overhead is essentially just the memory management and string header allocation. This overhead usually only shows up for small string manipulations. The performance loss has to be considered in light of the following: 1) What would be the performance loss of trying to write this management code in one's own application? 2) Since the bstring library source code is given, a sufficiently powerful modern inlining globally optimizing compiler can remove function call overhead. Since the data type is exposed, a developer can replace any unsatisfactory function with their own inline implementation. And that is besides the main point of what the better string library is mainly meant to provide. Any overhead lost has to be compared against the value of the safe abstraction for coupling memory management and string functionality. Performance of the C interface: ............................... The algorithms used have performance advantages versus the analogous C library functions. For example: 1. bfromcstr/blk2str/bstrcpy versus strcpy/strdup. By using memmove instead of strcpy, the break condition of the copy loop is based on an independent counter (that should be allocated in a register) rather than having to check the results of the load. Modern out-of-order executing CPUs can parallelize the final branch mis-predict penality with the loading of the source string. Some CPUs will also tend to have better built-in hardware support for counted memory moves than load-compare-store. (This is a minor, but non-zero gain.) 2. biseq versus strcmp. If the strings are unequal in length, bsiseq will return in O(1) time. If the strings are aliased, or have aliased data buffers, biseq will return in O(1) time. strcmp will always be O(k), where k is the length of the common prefix or the whole string if they are identical. 3. ->slen versus strlen. ->slen is obviously always O(1), while strlen is always O(n) where n is the length of the string. 4. bconcat versus strcat. Both rely on precomputing the length of the destination string argument, which will favor the bstring library. On iterated concatenations the performance difference can be enormous. 5. bsreadln versus fgets. The bsreadln function reads large blocks at a time from the given stream, then parses out lines from the buffers directly. Some C libraries will implement fgets as a loop over single fgetc calls. Testing indicates that the bsreadln approach can be several times faster for fast stream devices (such as a file that has been entirely cached.) 6. bsplits/bsplitscb versus strspn. Accelerators for the set of match characters are generated only once. 7. binstr versus strstr. The binstr implementation unrolls the loops to help reduce loop overhead. This will matter if the target string is long and source string is not found very early in the target string. With strstr, while it is possible to unroll the source contents, it is not possible to do so with the destination contents in a way that is effective because every destination character must be tested against '\0' before proceeding to the next character. 8. bReverse versus strrev. The C function must find the end of the string first before swaping character pairs. 9. bstrrchr versus no comparable C function. Its not hard to write some C code to search for a character from the end going backwards. But there is no way to do this without computing the length of the string with strlen. Practical testing indicates that in general Bstrlib is never signifcantly slower than the C library for common operations, while very often having a performance advantage that ranges from significant to massive. Even for functions like b(n)inchr versus str(c)spn() (where, in theory, there is no advantage for the Bstrlib architecture) the performance of Bstrlib is vastly superior to most tested C library implementations. Some of Bstrlib's extra functionality also lead to inevitable performance advantages over typical C solutions. For example, using the blk2tbstr macro, one can (in O(1) time) generate an internal substring by reference while not disturbing the original string. If disturbing the original string is not an option, typically, a comparable char * solution would have to make a copy of the substring to provide similar functionality. Another example is reverse character set scanning -- the str(c)spn functions only scan in a forward direction which can complicate some parsing algorithms. Where high performance char * based algorithms are available, Bstrlib can still leverage them by accessing the ->data field on bstrings. So realistically Bstrlib can never be significantly slower than any standard '\0' terminated char * based solutions. Performance of the C++ interface: ................................. The C++ interface has been designed with an emphasis on abstraction and safety first. However, since it is substantially a wrapper for the C bstring functions, for longer strings the performance comments described in the "Performance of the C interface" section above still apply. Note that the (CBString *) type can be directly cast to a (bstring) type, and passed as parameters to the C functions (though a CBString must never be passed to bdestroy.) Probably the most controversial choice is performing full bounds checking on the [] operator. This decision was made because 1) the fast alternative of not bounds checking is still available by first casting the CBString to a (const char *) buffer or to a (struct tagbstring) then derefencing .data and 2) because the lack of bounds checking is seen as one of the main weaknesses of C/C++ versus other languages. This check being done on every access leads to individual character extraction being actually slower than other languages in this one respect (other language's compilers will normally dedicate more resources on hoisting or removing bounds checking as necessary) but otherwise bring C++ up to the level of other languages in terms of functionality. It is common for other C++ libraries to leverage the abstractions provided by C++ to use reference counting and "copy on write" policies. While these techniques can speed up some scenarios, they impose a problem with respect to thread safety. bstrings and CBStrings can be properly protected with "per-object" mutexes, meaning that two bstrlib calls can be made and execute simultaneously, so long as the bstrings and CBstrings are distinct. With a reference count and alias before copy on write policy, global mutexes are required that prevent multiple calls to the strings library to execute simultaneously regardless of whether or not the strings represent the same string. One interesting trade off in CBString is that the default constructor is not trivial. I.e., it always prepares a ready to use memory buffer. The purpose is to ensure that there is a uniform internal composition for any functioning CBString that is compatible with bstrings. It also means that the other methods in the class are not forced to perform "late initialization" checks. In the end it means that construction of CBStrings are slower than other comparable C++ string classes. Initial testing, however, indicates that CBString outperforms std::string and MFC's CString, for example, in all other operations. So to work around this weakness it is recommended that CBString declarations be pushed outside of inner loops. Practical testing indicates that with the exception of the caveats given above (constructors and safe index character manipulations) the C++ API for Bstrlib generally outperforms popular standard C++ string classes. Amongst the standard libraries and compilers, the quality of concatenation operations varies wildly and very little care has gone into search functions. Bstrlib dominates those performance benchmarks. Memory management: .................. The bstring functions which write and modify bstrings will automatically reallocate the backing memory for the char buffer whenever it is required to grow. The algorithm for resizing chosen is to snap up to sizes that are a power of two which are sufficient to hold the intended new size. Memory reallocation is not performed when the required size of the buffer is decreased. This behavior can be relied on, and is necessary to make the behaviour of balloc deterministic. This trades off additional memory usage for decreasing the frequency for required reallocations: 1. For any bstring whose size never exceeds n, its buffer is not ever reallocated more than log_2(n) times for its lifetime. 2. For any bstring whose size never exceeds n, its buffer is never more than 2*(n+1) in length. (The extra characters beyond 2*n are to allow for the implicit '\0' which is always added by the bstring modifying functions.) Decreasing the buffer size when the string decreases in size would violate 1) above and in real world case lead to pathological heap thrashing. Similarly, allocating more tightly than "least power of 2 greater than necessary" would lead to a violation of 1) and have the same potential for heap thrashing. Property 2) needs emphasizing. Although the memory allocated is always a power of 2, for a bstring that grows linearly in size, its buffer memory also grows linearly, not exponentially. The reason is that the amount of extra space increases with each reallocation, which decreases the frequency of future reallocations. Obviously, given that bstring writing functions may reallocate the data buffer backing the target bstring, one should not attempt to cache the data buffer address and use it after such bstring functions have been called. This includes making reference struct tagbstrings which alias to a writable bstring. balloc or bfromcstralloc can be used to preallocate the minimum amount of space used for a given bstring. This will reduce even further the number of times the data portion is reallocated. If the length of the string is never more than one less than the memory length then there will be no further reallocations. Note that invoking the bwriteallow macro may increase the number of reallocs by one more than necessary for every call to bwriteallow interleaved with any bstring API which writes to this bstring. The library does not use any mechanism for automatic clean up for the C API. Thus explicit clean up via calls to bdestroy() are required to avoid memory leaks. Constant and static tagbstrings: ................................ A struct tagbstring can be write protected from any bstrlib function using the bwriteprotect macro. A write protected struct tagbstring can then be reset to being writable via the bwriteallow macro. There is, of course, no protection from attempts to directly access the bstring members. Modifying a bstring which is write protected by direct access has undefined behavior. static struct tagbstrings can be declared via the bsStatic macro. They are considered permanently unwritable. Such struct tagbstrings's are declared such that attempts to write to it are not well defined. Invoking either bwriteallow or bwriteprotect on static struct tagbstrings has no effect. struct tagbstring's initialized via btfromcstr or blk2tbstr are protected by default but can be made writeable via the bwriteallow macro. If bwriteallow is called on such struct tagbstring's, it is the programmer's responsibility to ensure that: 1) the buffer supplied was allocated from the heap. 2) bdestroy is not called on this tagbstring (unless the header itself has also been allocated from the heap.) 3) free is called on the buffer to reclaim its memory. bwriteallow and bwriteprotect can be invoked on ordinary bstrings (they have to be dereferenced with the (*) operator to get the levels of indirection correct) to give them write protection. Buffer declaration: ................... The memory buffer is actually declared "unsigned char *" instead of "char *". The reason for this is to trigger compiler warnings whenever uncasted char buffers are assigned to the data portion of a bstring. This will draw more diligent programmers into taking a second look at the code where they have carelessly left off the typically required cast. (Research from AT&T/Lucent indicates that additional programmer eyeballs is one of the most effective mechanisms at ferreting out bugs.) Function pointers: .................. The bgets, bread and bStream functions use function pointers to obtain strings from data streams. The function pointer declarations have been specifically chosen to be compatible with the fgetc and fread functions. While this may seem to be a convoluted way of implementing fgets and fread style functionality, it has been specifically designed this way to ensure that there is no dependency on a single narrowly defined set of device interfaces, such as just stream I/O. In the embedded world, its quite possible to have environments where such interfaces may not exist in the standard C library form. Furthermore, the generalization that this opens up allows for more sophisticated uses for these functions (performing an fgets like function on a socket, for example.) By using function pointers, it also allows such abstract stream interfaces to be created using the bstring library itself while not creating a circular dependency. Use of int's for sizes: ....................... This is just a recognition that 16bit platforms with requirements for strings that are larger than 64K and 32bit+ platforms with requirements for strings that are larger than 4GB are pretty marginal. The main focus is for 32bit platforms, and emerging 64bit platforms with reasonable < 4GB string requirements. Using ints allows for negative values which has meaning internally to bstrlib. Semantic consideration: ....................... Certain care needs to be taken when copying and aliasing bstrings. A bstring is essentially a pointer type which points to a multipart abstract data structure. Thus usage, and lifetime of bstrings have semantics that follow these considerations. For example: bstring a, b; struct tagbstring t; a = bfromcstr("Hello"); /* Create new bstring and copy "Hello" into it. */ b = a; /* Alias b to the contents of a. */ t = *a; /* Create a current instance pseudo-alias of a. */ bconcat (a, b); /* Double a and b, t is now undefined. */ bdestroy (a); /* Destroy the contents of both a and b. */ Variables of type bstring are really just references that point to real bstring objects. The equal operator (=) creates aliases, and the asterisk dereference operator (*) creates a kind of alias to the current instance (which is generally not useful for any purpose.) Using bstrcpy() is the correct way of creating duplicate instances. The ampersand operator (&) is useful for creating aliases to struct tagbstrings (remembering that constructed struct tagbstrings are not writable by default.) CBStrings use complete copy semantics for the equal operator (=), and thus do not have these sorts of issues. Debugging: .......... Bstrings have a simple, exposed definition and construction, and the library itself is open source. So most debugging is going to be fairly straight- forward. But the memory for bstrings come from the heap, which can often be corrupted indirectly, and it might not be obvious what has happened even from direct examination of the contents in a debugger or a core dump. There are some tools such as Purify, Insure++ and Electric Fence which can help solve such problems, however another common approach is to directly instrument the calls to malloc, realloc, calloc, free, memcpy, memmove and/or other calls by overriding them with macro definitions. Although the user could hack on the Bstrlib sources directly as necessary to perform such an instrumentation, Bstrlib comes with a built-in mechanism for doing this. By defining the macro BSTRLIB_MEMORY_DEBUG and providing an include file named memdbg.h this will force the core Bstrlib modules to attempt to include this file. In such a file, macros could be defined which overrides Bstrlib's useage of the C standard library. Rather than calling malloc, realloc, free, memcpy or memmove directly, Bstrlib emits the macros bstr__alloc, bstr__realloc, bstr__free, bstr__memcpy and bstr__memmove in their place respectively. By default these macros are simply assigned to be equivalent to their corresponding C standard library function call. However, if they are given earlier macro definitions (via the back door include file) they will not be given their default definition. In this way Bstrlib's interface to the standard library can be changed but without having to directly redefine or link standard library symbols (both of which are not strictly ANSI C compliant.) An example definition might include: #define bstr__alloc(sz) X_malloc ((sz), __LINE__, __FILE__) which might help contextualize heap entries in a debugging environment. The NULL parameter and sanity checking of bstrings is part of the Bstrlib API, and thus Bstrlib itself does not present any different modes which would correspond to "Debug" or "Release" modes. Bstrlib always contains mechanisms which one might think of as debugging features, but retains the performance and small memory footprint one would normally associate with release mode code. Integration Microsoft's Visual Studio debugger: ............................................... Microsoft's Visual Studio debugger has a capability of customizable mouse float over data type descriptions. This is accomplished by editting the AUTOEXP.DAT file to include the following: ; new for CBString tagbstring =slen= mlen= Bstrlib::CBStringList =count= In Visual C++ 6.0 this file is located in the directory: C:\Program Files\Microsoft Visual Studio\Common\MSDev98\Bin and in Visual Studio .NET 2003 its located here: C:\Program Files\Microsoft Visual Studio .NET 2003\Common7\Packages\Debugger This will improve the ability of debugging with Bstrlib under Visual Studio. Security -------- Bstrlib does not come with explicit security features outside of its fairly comprehensive error detection, coupled with its strict semantic support. That is to say that certain common security problems, such as buffer overrun, constant overwrite, arbitrary truncation etc, are far less likely to happen inadvertently. Where it does help, Bstrlib maximizes its advantage by providing developers a simple adoption path that lets them leave less secure string mechanisms behind. The library will not leave developers wanting, so they will be less likely to add new code using a less secure string library to add functionality that might be missing from Bstrlib. That said there are a number of security ideas not addressed by Bstrlib: 1. Race condition exploitation (i.e., verifying a string's contents, then raising the privilege level and execute it as a shell command as two non-atomic steps) is well beyond the scope of what Bstrlib can provide. It should be noted that MFC's built-in string mutex actually does not solve this problem either -- it just removes immediate data corruption as a possible outcome of such exploit attempts (it can be argued that this is worse, since it will leave no trace of the exploitation). In general race conditions have to be dealt with by careful design and implementation; it cannot be assisted by a string library. 2. Any kind of access control or security attributes to prevent usage in dangerous interfaces such as system(). Perl includes a "trust" attribute which can be endowed upon strings that are intended to be passed to such dangerous interfaces. However, Perl's solution reflects its own limitations -- notably that it is not a strongly typed language. In the example code for Bstrlib, there is a module called taint.cpp. It demonstrates how to write a simple wrapper class for managing "untainted" or trusted strings using the type system to prevent questionable mixing of ordinary untrusted strings with untainted ones then passing them to dangerous interfaces. In this way the security correctness of the code reduces to auditing the direct usages of dangerous interfaces or promotions of tainted strings to untainted ones. 3. Encryption of string contents is way beyond the scope of Bstrlib. Maintaining encrypted string contents in the futile hopes of thwarting things like using system-level debuggers to examine sensitive string data is likely to be a wasted effort (imagine a debugger that runs at a higher level than a virtual processor where the application runs). For more standard encryption usages, since the bstring contents are simply binary blocks of data, this should pose no problem for usage with other standard encryption libraries. Compatibility ------------- The Better String Library is known to compile and function correctly with the following compilers: - Microsoft Visual C++ - Watcom C/C++ - Intel's C/C++ compiler (Windows) - The GNU C/C++ compiler (cygwin and Linux on PPC64) - Borland C - Turbo C Setting of configuration options should be unnecessary for these compilers (unless exceptions are being disabled or STLport has been added to WATCOM C/C++). Bstrlib has been developed with an emphasis on portability. As such porting it to other compilers should be straight forward. This package includes a porting guide (called porting.txt) which explains what issues may exist for porting Bstrlib to different compilers and environments. ANSI issues ----------- 1. The function pointer types bNgetc and bNread have prototypes which are very similar to, but not exactly the same as fgetc and fread respectively. Basically the FILE * parameter is replaced by void *. The purpose of this was to allow one to create other functions with fgetc and fread like semantics without being tied to ANSI C's file streaming mechanism. I.e., one could very easily adapt it to sockets, or simply reading a block of memory, or procedurally generated strings (for fractal generation, for example.) The problem is that invoking the functions (bNgetc)fgetc and (bNread)fread is not technically legal in ANSI C. The reason being that the compiler is only able to coerce the function pointers themselves into the target type, however are unable to perform any cast (implicit or otherwise) on the parameters passed once invoked. I.e., if internally void * and FILE * need some kind of mechanical coercion, the compiler will not properly perform this conversion and thus lead to undefined behavior. Apparently a platform from Data General called "Eclipse" and another from Tandem called "NonStop" have a different representation for pointers to bytes and pointers to words, for example, where coercion via casting is necessary. (Actual confirmation of the existence of such machines is hard to come by, so it is prudent to be skeptical about this information.) However, this is not an issue for any known contemporary platforms. One may conclude that such platforms are effectively apocryphal even if they do exist. To correctly work around this problem to the satisfaction of the ANSI limitations, one needs to create wrapper functions for fgets and/or fread with the prototypes of bNgetc and/or bNread respectively which performs no other action other than to explicitely cast the void * parameter to a FILE *, and simply pass the remaining parameters straight to the function pointer call. The wrappers themselves are trivial: size_t freadWrap (void * buff, size_t esz, size_t eqty, void * parm) { return fread (buff, esz, eqty, (FILE *) parm); } int fgetcWrap (void * parm) { return fgetc ((FILE *) parm); } These have not been supplied in bstrlib or bstraux to prevent unnecessary linking with file I/O functions. 2. vsnprintf is not available on all compilers. Because of this, the bformat and bformata functions (and format and formata methods) are not guaranteed to work properly. For those compilers that don't have vsnprintf, the BSTRLIB_NOVSNP macro should be set before compiling bstrlib, and the format functions/method will be disabled. The more recent ANSI C standards have specified the required inclusion of a vsnprintf function. 3. The bstrlib function names are not unique in the first 6 characters. This is only an issue for older C compiler environments which do not store more than 6 characters for function names. 4. The bsafe module defines macros and function names which are part of the C library. This simply overrides the definition as expected on all platforms tested, however it is not sanctioned by the ANSI standard. This module is clearly optional and should be omitted on platforms which disallow its undefined semantics. In practice the real issue is that some compilers in some modes of operation can/will inline these standard library functions on a module by module basis as they appear in each. The linker will thus have no opportunity to override the implementation of these functions for those cases. This can lead to inconsistent behaviour of the bsafe module on different platforms and compilers. =============================================================================== Comparison with Microsoft's CString class ----------------------------------------- Although developed independently, CBStrings have very similar functionality to Microsoft's CString class. However, the bstring library has significant advantages over CString: 1. Bstrlib is a C-library as well as a C++ library (using the C++ wrapper). - Thus it is compatible with more programming environments and available to a wider population of programmers. 2. The internal structure of a bstring is considered exposed. - A single contiguous block of data can be cut into read-only pieces by simply creating headers, without allocating additional memory to create reference copies of each of these sub-strings. - In this way, using bstrings in a totally abstracted way becomes a choice rather than an imposition. Further this choice can be made differently at different layers of applications that use it. 3. Static declaration support precludes the need for constructor invocation. - Allows for static declarations of constant strings that has no additional constructor overhead. 4. Bstrlib is not attached to another library. - Bstrlib is designed to be easily plugged into any other library collection, without dependencies on other libraries or paradigms (such as "MFC".) The bstring library also comes with a few additional functions that are not available in the CString class: - bsetstr - bsplit - bread - breplace (this is different from CString::Replace()) - Writable indexed characters (for example a[i]='x') Interestingly, although Microsoft did implement mid$(), left$() and right$() functional analogues (these are functions from GWBASIC) they seem to have forgotten that mid$() could be also used to write into the middle of a string. This functionality exists in Bstrlib with the bsetstr() and breplace() functions. Among the disadvantages of Bstrlib is that there is no special support for localization or wide characters. Such things are considered beyond the scope of what bstrings are trying to deliver. CString essentially supports the older UCS-2 version of Unicode via widechar_t as an application-wide compile time switch. CString's also use built-in mechanisms for ensuring thread safety under all situations. While this makes writing thread safe code that much easier, this built-in safety feature has a price -- the inner loops of each CString method runs in its own critical section (grabbing and releasing a light weight mutex on every operation.) The usual way to decrease the impact of a critical section performance penalty is to amortize more operations per critical section. But since the implementation of CStrings is fixed as a one critical section per-operation cost, there is no way to leverage this common performance enhancing idea. The search facilities in Bstrlib are comparable to those in MFC's CString class, though it is missing locale specific collation. But because Bstrlib is interoperable with C's char buffers, it will allow programmers to write their own string searching mechanism (such as Boyer-Moore), or be able to choose from a variety of available existing string searching libraries (such as those for regular expressions) without difficulty. Microsoft used a very non-ANSI conforming trick in its implementation to allow printf() to use the "%s" specifier to output a CString correctly. This can be convenient, but it is inherently not portable. CBString requires an explicit cast, while bstring requires the data member to be dereferenced. Microsoft's own documentation recommends casting, instead of relying on this feature. Comparison with C++'s std::string --------------------------------- This is the C++ language's standard STL based string class. 1. There is no C implementation. 2. The [] operator is not bounds checked. 3. Missing a lot of useful functions like printf-like formatting. 4. Some sub-standard std::string implementations (SGI) are necessarily unsafe to use with multithreading. 5. Limited by STL's std::iostream which in turn is limited by ifstream which can only take input from files. (Compare to CBStream's API which can take abstracted input.) 6. Extremely uneven performance across implementations. Comparison with ISO C TR 24731 proposal --------------------------------------- Following the ISO C99 standard, Microsoft has proposed a group of C library extensions which are supposedly "safer and more secure". This proposal is expected to be adopted by the ISO C standard which follows C99. The proposal reveals itself to be very similar to Microsoft's "StrSafe" library. The functions are basically the same as other standard C library string functions except that destination parameters are paired with an additional length parameter of type rsize_t. rsize_t is the same as size_t, however, the range is checked to make sure its between 1 and RSIZE_MAX. Like Bstrlib, the functions perform a "parameter check". Unlike Bstrlib, when a parameter check fails, rather than simply outputing accumulatable error statuses, they call a user settable global error function handler, and upon return of control performs no (additional) detrimental action. The proposal covers basic string functions as well as a few non-reenterable functions (asctime, ctime, and strtok). 1. Still based solely on char * buffers (and therefore strlen() and strcat() is still O(n), and there are no faster streq() comparison functions.) 2. No growable string semantics. 3. Requires manual buffer length synchronization in the source code. 4. No attempt to enhance functionality of the C library. 5. Introduces a new error scenario (strings exceeding RSIZE_MAX length). The hope is that by exposing the buffer length requirements there will be fewer buffer overrun errors. However, the error modes are really just transformed, rather than removed. The real problem of buffer overflows is that they all happen as a result of erroneous programming. So forcing programmers to manually deal with buffer limits, will make them more aware of the problem but doesn't remove the possibility of erroneous programming. So a programmer that erroneously mixes up the rsize_t parameters is no better off from a programmer that introduces potential buffer overflows through other more typical lapses. So at best this may reduce the rate of erroneous programming, rather than making any attempt at removing failure modes. The error handler can discriminate between types of failures, but does not take into account any callsite context. So the problem is that the error is going to be manifest in a piece of code, but there is no pointer to that code. It would seem that passing in the call site __FILE__, __LINE__ as parameters would be very useful, but the API clearly doesn't support such a thing (it would increase code bloat even more than the extra length parameter does, and would require macro tricks to implement). The Bstrlib C API takes the position that error handling needs to be done at the callsite, and just tries to make it as painless as possible. Furthermore, error modes are removed by supporting auto-growing strings and aliasing. For capturing errors in more central code fragments, Bstrlib's C++ API uses exception handling extensively, which is superior to the leaf-only error handler approach. Comparison with Managed String Library CERT proposal ---------------------------------------------------- The main webpage for the managed string library: http://www.cert.org/secure-coding/managedstring.html Robert Seacord at CERT has proposed a C string library that he calls the "Managed String Library" for C. Like Bstrlib, it introduces a new type which is called a managed string. The structure of a managed string (string_m) is like a struct tagbstring but missing the length field. This internal structure is considered opaque. The length is, like the C standard library, always computed on the fly by searching for a terminating NUL on every operation that requires it. So it suffers from every performance problem that the C standard library suffers from. Interoperating with C string APIs (like printf, fopen, or anything else that takes a string parameter) requires copying to additionally allocating buffers that have to be manually freed -- this makes this library probably slower and more cumbersome than any other string library in existence. The library gives a fully populated error status as the return value of every string function. The hope is to be able to diagnose all problems specifically from the return code alone. Comparing this to Bstrlib, which aways returns one consistent error message, might make it seem that Bstrlib would be harder to debug; but this is not true. With Bstrlib, if an error occurs there is always enough information from just knowing there was an error and examining the parameters to deduce exactly what kind of error has happened. The managed string library thus gives up nested function calls while achieving little benefit, while Bstrlib does not. One interesting feature that "managed strings" has is the idea of data sanitization via character set whitelisting. That is to say, a globally definable filter that makes any attempt to put invalid characters into strings lead to an error and not modify the string. The author gives the following example: // create valid char set if (retValue = strcreate_m(&str1, "abc") ) { fprintf( stderr, "Error %d from strcreate_m.\n", retValue ); } if (retValue = setcharset(str1)) { fprintf( stderr, "Error %d from setcharset().\n", retValue ); } if (retValue = strcreate_m(&str1, "aabbccabc")) { fprintf( stderr, "Error %d from strcreate_m.\n", retValue ); } // create string with invalid char set if (retValue = strcreate_m(&str1, "abbccdabc")) { fprintf( stderr, "Error %d from strcreate_m.\n", retValue ); } Which we can compare with a more Bstrlib way of doing things: bstring bCreateWithFilter (const char * cstr, const_bstring filter) { bstring b = bfromcstr (cstr); if (BSTR_ERR != bninchr (b, filter) && NULL != b) { fprintf (stderr, "Filter violation.\n"); bdestroy (b); b = NULL; } return b; } struct tagbstring charFilter = bsStatic ("abc"); bstring str1 = bCreateWithFilter ("aabbccabc", &charFilter); bstring str2 = bCreateWithFilter ("aabbccdabc", &charFilter); The first thing we should notice is that with the Bstrlib approach you can have different filters for different strings if necessary. Furthermore, selecting a charset filter in the Managed String Library is uni-contextual. That is to say, there can only be one such filter active for the entire program, which means its usage is not well defined for intermediate library usage (a library that uses it will interfere with user code that uses it, and vice versa.) It is also likely to be poorly defined in multi-threading environments. There is also a question as to whether the data sanitization filter is checked on every operation, or just on creation operations. Since the charset can be set arbitrarily at run time, it might be set *after* some managed strings have been created. This would seem to imply that all functions should run this additional check every time if there is an attempt to enforce this. This would make things tremendously slow. On the other hand, if it is assumed that only creates and other operations that take char *'s as input need be checked because the charset was only supposed to be called once at and before any other managed string was created, then one can see that its easy to cover Bstrlib with equivalent functionality via a few wrapper calls such as the example given above. And finally we have to question the value of sanitation in the first place. For example, for httpd servers, there is generally a requirement that the URLs parsed have some form that avoids undesirable translation to local file system filenames or resources. The problem is that the way URLs can be encoded, it must be completely parsed and translated to know if it is using certain invalid character combinations. That is to say, merely filtering each character one at a time is not necessarily the right way to ensure that a string has safe contents. In the article that describes this proposal, it is claimed that it fairly closely approximates the existing C API semantics. On this point we should compare this "closeness" with Bstrlib: Bstrlib Managed String Library ------- ---------------------- Pointer arithmetic Segment arithmetic N/A Use in C Std lib ->data, or bdata{e} getstr_m(x,*) ... free(x) String literals bsStatic, bsStaticBlk strcreate_m() Transparency Complete None Its pretty clear that the semantic mapping from C strings to Bstrlib is fairly straightforward, and that in general semantic capabilities are the same or superior in Bstrlib. On the other hand the Managed String Library is either missing semantics or changes things fairly significantly. Comparison with Annexia's c2lib library --------------------------------------- This library is available at: http://www.annexia.org/freeware/c2lib 1. Still based solely on char * buffers (and therefore strlen() and strcat() is still O(n), and there are no faster streq() comparison functions.) Their suggestion that alternatives which wrap the string data type (such as bstring does) imposes a difficulty in interoperating with the C langauge's ordinary C string library is not founded. 2. Introduction of memory (and vector?) abstractions imposes a learning curve, and some kind of memory usage policy that is outside of the strings themselves (and therefore must be maintained by the developer.) 3. The API is massive, and filled with all sorts of trivial (pjoin) and controvertial (pmatch -- regular expression are not sufficiently standardized, and there is a very large difference in performance between compiled and non-compiled, REs) functions. Bstrlib takes a decidely minimal approach -- none of the functionality in c2lib is difficult or challenging to implement on top of Bstrlib (except the regex stuff, which is going to be difficult, and controvertial no matter what.) 4. Understanding why c2lib is the way it is pretty much requires a working knowledge of Perl. bstrlib requires only knowledge of the C string library while providing just a very select few worthwhile extras. 5. It is attached to a lot of cruft like a matrix math library (that doesn't include any functions for getting the determinant, eigenvectors, eigenvalues, the matrix inverse, test for singularity, test for orthogonality, a grahm schmit orthogonlization, LU decomposition ... I mean why bother?) Convincing a development house to use c2lib is likely quite difficult. It introduces too much, while not being part of any kind of standards body. The code must therefore be trusted, or maintained by those that use it. While bstring offers nothing more on this front, since its so much smaller, covers far less in terms of scope, and will typically improve string performance, the barrier to usage should be much smaller. Comparison with stralloc/qmail ------------------------------ More information about this library can be found here: http://www.canonical.org/~kragen/stralloc.html or here: http://cr.yp.to/lib/stralloc.html 1. Library is very very minimal. A little too minimal. 2. Untargetted source parameters are not declared const. 3. Slightly different expected emphasis (like _cats function which takes an ordinary C string char buffer as a parameter.) Its clear that the remainder of the C string library is still required to perform more useful string operations. The struct declaration for their string header is essentially the same as that for bstring. But its clear that this was a quickly written hack whose goals are clearly a subset of what Bstrlib supplies. For anyone who is served by stralloc, Bstrlib is complete substitute that just adds more functionality. stralloc actually uses the interesting policy that a NULL data pointer indicates an empty string. In this way, non-static empty strings can be declared without construction. This advantage is minimal, since static empty bstrings can be declared inline without construction, and if the string needs to be written to it should be constructed from an empty string (or its first initializer) in any event. wxString class -------------- This is the string class used in the wxWindows project. A description of wxString can be found here: http://www.wxwindows.org/manuals/2.4.2/wx368.htm#wxstring This C++ library is similar to CBString. However, it is littered with trivial functions (IsAscii, UpperCase, RemoveLast etc.) 1. There is no C implementation. 2. The memory management strategy is to allocate a bounded fixed amount of additional space on each resize, meaning that it does not have the log_2(n) property that Bstrlib has (it will thrash very easily, cause massive fragmentation in common heap implementations, and can easily be a common source of performance problems). 3. The library uses a "copy on write" strategy, meaning that it has to deal with multithreading problems. Vstr ---- This is a highly orthogonal C string library with an emphasis on networking/realtime programming. It can be found here: http://www.and.org/vstr/ 1. The convoluted internal structure does not contain a '\0' char * compatible buffer, so interoperability with the C library a non-starter. 2. The API and implementation is very large (owing to its orthogonality) and can lead to difficulty in understanding its exact functionality. 3. An obvious dependency on gnu tools (confusing make configure step) 4. Uses a reference counting system, meaning that it is not likely to be thread safe. The implementation has an extreme emphasis on performance for nontrivial actions (adds, inserts and deletes are all constant or roughly O(#operations) time) following the "zero copy" principle. This trades off performance of trivial functions (character access, char buffer access/coersion, alias detection) which becomes significantly slower, as well as incremental accumulative costs for its searching/parsing functions. Whether or not Vstr wins any particular performance benchmark will depend a lot on the benchmark, but it should handily win on some, while losing dreadfully on others. The learning curve for Vstr is very steep, and it doesn't come with any obvious way to build for Windows or other platforms without gnu tools. At least one mechanism (the iterator) introduces a new undefined scenario (writing to a Vstr while iterating through it.) Vstr has a very large footprint, and is very ambitious in its total functionality. Vstr has no C++ API. Vstr usage requires context initialization via vstr_init() which must be run in a thread-local context. Given the totally reference based architecture this means that sharing Vstrings across threads is not well defined, or at least not safe from race conditions. This API is clearly geared to the older standard of fork() style multitasking in UNIX, and is not safely transportable to modern shared memory multithreading available in Linux and Windows. There is no portable external solution making the library thread safe (since it requires a mutex around each Vstr context -- not each string.) In the documentation for this library, a big deal is made of its self hosted s(n)printf-like function. This is an issue for older compilers that don't include vsnprintf(), but also an issue because Vstr has a slow conversion to '\0' terminated char * mechanism. That is to say, using "%s" to format data that originates from Vstr would be slow without some sort of native function to do so. Bstrlib sidesteps the issue by relying on what snprintf-like functionality does exist and having a high performance conversion to a char * compatible string so that "%s" can be used directly. Str Library ----------- This is a fairly extensive string library, that includes full unicode support and targetted at the goal of out performing MFC and STL. The architecture, similarly to MFC's CStrings, is a copy on write reference counting mechanism. http://www.utilitycode.com/str/default.aspx 1. Commercial. 2. C++ only. This library, like Vstr, uses a ref counting system. There is only so deeply I can analyze it, since I don't have a license for it. However, performance improvements over MFC's and STL, doesn't seem like a sufficient reason to move your source base to it. For example, in the future, Microsoft may improve the performance CString. It should be pointed out that performance testing of Bstrlib has indicated that its relative performance advantage versus MFC's CString and STL's std::string is at least as high as that for the Str library. libmib astrings --------------- A handful of functional extensions to the C library that add dynamic string functionality. http://www.mibsoftware.com/libmib/astring/ This package basically references strings through char ** pointers and assumes they are pointing to the top of an allocated heap entry (or NULL, in which case memory will be newly allocated from the heap.) So its still up to user to mix and match the older C string functions with these functions whenever pointer arithmetic is used (i.e., there is no leveraging of the type system to assert semantic differences between references and base strings as Bstrlib does since no new types are introduced.) Unlike Bstrlib, exact string length meta data is not stored, thus requiring a strlen() call on *every* string writing operation. The library is very small, covering only a handful of C's functions. While this is better than nothing, it is clearly slower than even the standard C library, less safe and less functional than Bstrlib. To explain the advantage of using libmib, their website shows an example of how dangerous C code: char buf[256]; char *pszExtraPath = ";/usr/local/bin"; strcpy(buf,getenv("PATH")); /* oops! could overrun! */ strcat(buf,pszExtraPath); /* Could overrun as well! */ printf("Checking...%s\n",buf); /* Some printfs overrun too! */ is avoided using libmib: char *pasz = 0; /* Must initialize to 0 */ char *paszOut = 0; char *pszExtraPath = ";/usr/local/bin"; if (!astrcpy(&pasz,getenv("PATH"))) /* malloc error */ exit(-1); if (!astrcat(&pasz,pszExtraPath)) /* malloc error */ exit(-1); /* Finally, a "limitless" printf! we can use */ asprintf(&paszOut,"Checking...%s\n",pasz);fputs(paszOut,stdout); astrfree(&pasz); /* Can use free(pasz) also. */ astrfree(&paszOut); However, compare this to Bstrlib: bstring b, out; bcatcstr (b = bfromcstr (getenv ("PATH")), ";/usr/local/bin"); out = bformat ("Checking...%s\n", bdatae (b, "")); /* if (out && b) */ fputs (bdatae (out, ""), stdout); bdestroy (b); bdestroy (out); Besides being shorter, we can see that error handling can be deferred right to the very end. Also, unlike the above two versions, if getenv() returns with NULL, the Bstrlib version will not exhibit undefined behavior. Initialization starts with the relevant content rather than an extra autoinitialization step. libclc ------ An attempt to add to the standard C library with a number of common useful functions, including additional string functions. http://libclc.sourceforge.net/ 1. Uses standard char * buffer, and adopts C 99's usage of "restrict" to pass the responsibility to guard against aliasing to the programmer. 2. Adds no safety or memory management whatsoever. 3. Most of the supplied string functions are completely trivial. The goals of libclc and Bstrlib are clearly quite different. fireString ---------- http://firestuff.org/ 1. Uses standard char * buffer, and adopts C 99's usage of "restrict" to pass the responsibility to guard against aliasing to the programmer. 2. Mixes char * and length wrapped buffers (estr) functions, doubling the API size, with safety limited to only half of the functions. Firestring was originally just a wrapper of char * functionality with extra length parameters. However, it has been augmented with the inclusion of the estr type which has similar functionality to stralloc. But firestring does not nearly cover the functional scope of Bstrlib. Safe C String Library --------------------- A library written for the purpose of increasing safety and power to C's string handling capabilities. http://www.zork.org/safestr/safestr.html 1. While the safestr_* functions are safe in of themselves, interoperating with char * string has dangerous unsafe modes of operation. 2. The architecture of safestr's causes the base pointer to change. Thus, its not practical/safe to store a safestr in multiple locations if any single instance can be manipulated. 3. Dependent on an additional error handling library. 4. Uses reference counting, meaning that it is either not thread safe or slow and not portable. I think the idea of reallocating (and hence potentially changing) the base pointer is a serious design flaw that is fatal to this architecture. True safety is obtained by having automatic handling of all common scenarios without creating implicit constraints on the user. Because of its automatic temporary clean up system, it cannot use "const" semantics on input arguments. Interesting anomolies such as: safestr_t s, t; s = safestr_replace (t = SAFESTR_TEMP ("This is a test"), SAFESTR_TEMP (" "), SAFESTR_TEMP (".")); /* t is now undefined. */ are possible. If one defines a function which takes a safestr_t as a parameter, then the function would not know whether or not the safestr_t is defined after it passes it to a safestr library function. The author recommended method for working around this problem is to examine the attributes of the safestr_t within the function which is to modify any of its parameters and play games with its reference count. I think, therefore, that the whole SAFESTR_TEMP idea is also fatally broken. The library implements immutability, optional non-resizability, and a "trust" flag. This trust flag is interesting, and suggests that applying any arbitrary sequence of safestr_* function calls on any set of trusted strings will result in a trusted string. It seems to me, however, that if one wanted to implement a trusted string semantic, one might do so by actually creating a different *type* and only implement the subset of string functions that are deemed safe (i.e., user input would be excluded, for example.) This, in essence, would allow the compiler to enforce trust propogation at compile time rather than run time. Non-resizability is also interesting, however, it seems marginal (i.e., to want a string that cannot be resized, yet can be modified and yet where a fixed sized buffer is undesirable.) Libsrt ------ This is a length based string library based on a slightly different strategy. The string contents are appended to the end of the header directly so strings only require a single allocation. However, whenever a reallocation occurs, the header is replicated and the base pointer for the string is changed. That means references to the string are only valid so long as they are not resized after any such reference is cached. The internal structure maintains a lot some state used to accelerate unicode manipulation. This makes sustainable usage of the library essentially opaque. This also creates a bottleneck for whatever extensions to the library one desires (write all extensions on top of the base library, put in a request to the author, or dedicate an expert to learn the internals of the library). The library is committed to Unicode representation of its string data, and therefore cannot be used as a generic buffer library. =============================================================================== Examples -------- Dumping a line numbered file: FILE * fp; int i, ret; struct bstrList * lines; struct tagbstring prefix = bsStatic ("-> "); if (NULL != (fp = fopen ("bstrlib.txt", "rb"))) { bstring b = bread ((bNread) fread, fp); fclose (fp); if (NULL != (lines = bsplit (b, '\n'))) { for (i=0; i < lines->qty; i++) { binsert (lines->entry[i], 0, &prefix, '?'); printf ("%04d: %s\n", i, bdatae (lines->entry[i], "NULL")); } bstrListDestroy (lines); } bdestroy (b); } For numerous other examples, see bstraux.c, bstraux.h and the example archive. =============================================================================== License ------- The Better String Library is available under either the BSD license (see the accompanying license.txt) or the Gnu Public License version 2 (see the accompanying gpl.txt) at the option of the user. =============================================================================== Acknowledgements ---------------- The following individuals have made significant contributions to the design and testing of the Better String Library: Bjorn Augestad Clint Olsen Darryl Bleau Fabian Cenedese Graham Wideman Ignacio Burgueno International Business Machines Corporation Ira Mica John Kortink Manuel Woelker Marcel van Kervinck Michael Hsieh Richard A. Smith Simon Ekstrom Wayne Scott Zed A. Shaw ===============================================================================