Skip to content

How can we represent a string?

Sequence

Most natural representation of the string is the sequence (aka list, aka array) of symbols (e.g. characters, words, sentences, tokens etc.). This is exactly how we read/write - symbol by symbol from-left-to-right or from-right-to-left or from-top-to-bottom.

How to tell if two strings are similar? Let’s start with a trivial example: two strings of the same length. Two strings of the same length are more similar the more characters they have in the same plac. For example, text vs test 3 characters out 4 are the same. So we can say that similarity is 0.75. Other way to say it is that edit distance is 1 (you need to change 1 symbol in order to transform first string to the second). This is called Hamming distance.

There are other edit distances, which also use insert, delete and transpose operations.

One way or another all similarity (dissimilarity) measures for sequences concern about order. On one side this is a clear mental model - how much characters you need to insert/substitute/delete/transpose in order to transform string. On the other side those algorithms tend to be quite expensive typical LCS, Levenstein algorithms are O(nm)O(nm) in time complexity and O(n)O(n)-O(nm)O(nm) in space complexity. And there is no way (at least known to me) to make algorithm offline e.g. precalculate some index which would speed up execution later. Note: there are modern algorithms which can calculate aproximate Levenstein distance in nearly linear time.

That is why it makes sense to use other representations as well.

Set

Set representation is very simple - take sequence of symbols and turn it into set. This transformation doesn’t preserve order or frequency. From set point of view text and ettx are the same ({t,e,x}\{t, e, x\}). So what the point of such representation then?

Set measures are quite cheap to calculate, and it is possible to do offline, for example, using inverted index. Set measures can be used to filter good candidates from a big list of strings and then use more expensive algorithms (like edit distance) on a smaller number of candidates.

Also there is a way to improve precision of set measures - before converting sequence to set we can group symbols into pairs or triples (in the same order they were in original string). This way order is preserved at least partially, but obviously it would require more space. Example, set of bigrams: text - {te,ex,xt}\{te, ex, xt\} vs ettx - {et,tt,tx}\{et, tt, tx\} (0 in common). This trick called ngrams or qgrams. Read more about it in grouping section.

Multiset

Bag

Bag is like a set, but it permits repetition of elements. So bag doesn’t preserve order (the same as set), but preserves frequency. Bag measures are more precise than set measures (because bag preserves more information). Otherwise pros and cons are the same as for set (see above).

Example, text - {{e,t,t,x}}\{\{e, t, t, x\}\} or {(e,1),(t,2),(x,1)}\{(e,1), (t,2), (x,1)\}

Vector

Conceptually vector representation is the same as bag (that is why I grouped them in multiset section). Difference is rather at implementation level. Bag probably would be implemented as hashtable (aka hash map, aka associative array). Vector probably would be implemented as array (probably elements would be aligned next to each other in memory), which allows to use SIMD (Single Instruction/Multiple Data) operations, which are much faster.

In order to represent strings as vectors you need to build vector space (aka coordinate, aka alphabet) and then put frequencies according to it. Example, lets say that vector space is: a,e,g,t,x, than vector for text is [0,1,0,2,1][0,1,0,2,1].

Vector representations often used in NLP (Natural Language Processing) tasks. There are even vector databases.

Operation correspondence

This is rough operations corespondance (similar, but not identical).

 intersectionunionsetxyxybagi=1Nmin(xi,yi)i=1Nmax(xi,yi)sequencelcs(x,y)scs(x,y)\begin{align} \text{ }& & &\text{intersection}& &\text{union}& \\ \text{set}& & &|x \cap y|& &|x \cup y|& \\ \text{bag}& & &\sum_{i=1}^{N} \min(x_i, y_i)& &\sum_{i=1}^{N} \max(x_i, y_i)& \\ \text{sequence}& & &|lcs(x,y)|& &|scs(x,y)|& \end{align}