Ngrams, Qgrams, skip-grams, etc.

When string converted to set, bag or vector - information about order lost. Grouping allows to negate this effect (at least to some extent).

Ngrams, Qgrams

I’m not sure if there is a difference between Ngrams and Qgrams. I would consider them the same. Examples:

bigrams - groups symbols in pairs, for example, t,e,x,t => te,ex,xt
trigrams - groups symbols in triples, for example, t,e,x,t => tex,ext
four-grams, five-grams etc.

Prefix (suffix) padding

By adding prefix you can add “weight” to the start of the word. For example if you need to implement autocomplete functionality you would be more interested in words which are more similar on the begining.

Example, trigrams with sufix 2: t,e,x,t => #,#,t,e,x,t => ##t,#te,tex,ext.

Similar idea of adding more weight to the begining of word used in Jaro-Winkler similarity.

Skip-grams

By adding “skips” it alows Ngrams to capture co-occurrence (reminds me of Markov chain).

Examples:

1-skip-bigrams: t,e,x,t => te,tx,ex,et,xt
1-skip-trigrams: t,e,x,t => tex,txt,tet,ext

Delete-grams

In Ngrams n defines how much symbols to group. In delete-grams n defines up to how much symbols to remove from the group.

For example, 1-delete-grams - t,e,x,t => text,tex,txt,tet,ext.

If two words have delete-grams in common it means that Levenstein distance between them is less or equal to 2*n.

Example:

t,e,x,t => text,tex,txt,tet,ext.
t,e,s,t => test,tes,tst,tet,est.
tet is a common gram between 2 strings. Levenstein distance between them is 1.

Longest common delete-gram is also LCS (Longest Common Subsequence).

Prefixes

Example:

t,e,x,t => text,tex,te,t.

Suffixes

Example:

t,e,x,t => text,ext,xt,t.

Real life examples

Delete-grams used in:
- FastSS (Fast Similarity Search)
- SymSpell
Ngrams can be used to speed up pattern matching and substring lookup:
Prefixes can be used to speed up “autocomplete”:
- tries
- database indexes (B-Tree, B+Tree and similar)
Suffixes can be used to speed up substring search
- suffix arrays