Skip to content

Ngrams, Qgrams, skip-grams, etc.

When string converted to set, bag or vector - information about order lost. Grouping allows to negate this effect (at least to some extent).

Ngrams, Qgrams

I’m not sure if there is a difference between Ngrams and Qgrams. I would consider them the same. Examples:

  • bigrams - groups symbols in pairs, for example, t,e,x,t => te,ex,xt
  • trigrams - groups symbols in triples, for example, t,e,x,t => tex,ext
  • four-grams, five-grams etc.

Prefix (suffix) padding

By adding prefix you can add “weight” to the start of the word. For example if you need to implement autocomplete functionality you would be more interested in words which are more similar on the begining.

Example, trigrams with sufix 2: t,e,x,t => #,#,t,e,x,t => ##t,#te,tex,ext.

Similar idea of adding more weight to the begining of word used in Jaro-Winkler similarity.

Skip-grams

By adding “skips” it alows Ngrams to capture co-occurrence (reminds me of Markov chain).

Examples:

  • 1-skip-bigrams: t,e,x,t => te,tx,ex,et,xt
  • 1-skip-trigrams: t,e,x,t => tex,txt,tet,ext

Delete-grams

In Ngrams n defines how much symbols to group. In delete-grams n defines up to how much symbols to remove from the group.

For example, 1-delete-grams - t,e,x,t => text,tex,txt,tet,ext.

If two words have delete-grams in common it means that Levenstein distance between them is less or equal to 2*n.

Example:

  • t,e,x,t => text,tex,txt,tet,ext.
  • t,e,s,t => test,tes,tst,tet,est.
  • tet is a common gram between 2 strings. Levenstein distance between them is 1.

Longest common delete-gram is also LCS (Longest Common Subsequence).

Prefixes

Example:

  • t,e,x,t => text,tex,te,t.

Suffixes

Example:

  • t,e,x,t => text,ext,xt,t.

Real life examples