levenshtein distance - How to calculate equal hash for similar strings? -
i create antiplagiat. use shingle method. example, have following shingles:
- i go cinema
- i go cinema1
- i go th cinema
is there method of calculating equal hash these lines?
i know of existence of levenshtein distance. however, not know should take source word. maybe there better way consider levenshtein distance.
the problem hashing that, logically, you'll run 2 strings differ single character hash different values.
small proof:
consider possible strings.
assume of these hash @ least 2 different values.
take 2 strings , b hash different values.
can go b changing 1 character @ time.
@ point hash change.
@ point hash different single character change.
some options can think of:
hash multiple parts of string , check each of these hashes. won't work since single character omission cause significant difference in hash values.
check range of hashes. hash 1 dimensional, string similarity not, won't work either.
all in all, hashing not way go.
Comments
Post a Comment