levenshtein distance - How to calculate equal hash for similar strings? -


i create antiplagiat. use shingle method. example, have following shingles:

  1. i go cinema
  2. i go cinema1
  3. i go th cinema

is there method of calculating equal hash these lines?

i know of existence of levenshtein distance. however, not know should take source word. maybe there better way consider levenshtein distance.

the problem hashing that, logically, you'll run 2 strings differ single character hash different values.

small proof:

consider possible strings.
assume of these hash @ least 2 different values.
take 2 strings , b hash different values.
can go b changing 1 character @ time.
@ point hash change.
@ point hash different single character change.

some options can think of:

  • hash multiple parts of string , check each of these hashes. won't work since single character omission cause significant difference in hash values.

  • check range of hashes. hash 1 dimensional, string similarity not, won't work either.

all in all, hashing not way go.


Comments

Popular posts from this blog

c++ - llvm function pass ReplaceInstWithInst malloc -

java.lang.NoClassDefFoundError When Creating New Android Project -

Decoding a Python 2 `tempfile` with python-future -