Ngrams, Qgrams, skip-grams, etc.
When string converted to set, bag or vector - information about order lost. Grouping allows to negate this effect (at least to some extent).
Ngrams, Qgrams
I’m not sure if there is a difference between Ngrams and Qgrams. I would consider them the same. Examples:
- bigrams - groups symbols in pairs, for example,
t,e,x,t
=>te,ex,xt
- trigrams - groups symbols in triples, for example,
t,e,x,t
=>tex,ext
- four-grams, five-grams etc.
Prefix (suffix) padding
By adding prefix you can add “weight” to the start of the word. For example if you need to implement autocomplete functionality you would be more interested in words which are more similar on the begining.
Example, trigrams with sufix 2: t,e,x,t
=> #,#,t,e,x,t
=> ##t,#te,tex,ext
.
Similar idea of adding more weight to the begining of word used in Jaro-Winkler similarity.
Skip-grams
By adding “skips” it alows Ngrams to capture co-occurrence (reminds me of Markov chain).
Examples:
- 1-skip-bigrams:
t,e,x,t
=>te,tx,ex,et,xt
- 1-skip-trigrams:
t,e,x,t
=>tex,txt,tet,ext
Delete-grams
In Ngrams n
defines how much symbols to group. In delete-grams n
defines up to how much symbols to remove from the group.
For example, 1-delete-grams - t,e,x,t
=> text,tex,txt,tet,ext
.
If two words have delete-grams in common it means that Levenstein distance between them is less or equal to 2*n
.
Example:
t,e,x,t
=>text,tex,txt,tet,ext
.t,e,s,t
=>test,tes,tst,tet,est
.tet
is a common gram between 2 strings. Levenstein distance between them is 1.
Longest common delete-gram is also LCS (Longest Common Subsequence).
Prefixes
Example:
t,e,x,t
=>text,tex,te,t
.
Suffixes
Example:
t,e,x,t
=>text,ext,xt,t
.
Real life examples
- Delete-grams used in:
- FastSS ↗ (Fast Similarity Search)
- SymSpell ↗
- Ngrams can be used to speed up pattern matching and substring lookup:
- Prefixes can be used to speed up “autocomplete”:
- tries
- database indexes (B-Tree, B+Tree and similar)
- Suffixes can be used to speed up substring search
- suffix arrays