0
t1 = 'This is an example data for testing minhashing' t2 = 'This is an example test data for minhashing' t3 = 'This is not related at all' t4 = t1 t5 = t3 # texts = [t1,t2,t3,t4,t5] texts = [t1,t2,t3,t4,t5] # Define Shingling function def shingle(s, l): #Generate k-tuple shingles of a string l = min(len(s), l) return([s[i:i+l] for i in range(len(s) - l + 1)]) # Declare punctuation exclude = set(string.punctuation) # Generate hash functions def hash_functions(): def hash_factory(ni): return(lambda x: hash("salt" + str(ni) + str(x) + "salt")) return [hash_factory(i) for i in range(2)] # Create list of shingle lists and Normalize text shingle_list = [shingle(''.join(ch for ch in d.lower().replace(' ','') if ch not in exclude),4) for d in texts] for x in shingle_list: print x # Generate LSH matrix LIST = [[min(fn(t) for t in tuples) for i, fn in enumerate(hash_functions())] for tuples in shingle_list] for x in LIST: print x 

Can anyone explain me what the hash function mentioned in the above snippet, it works fine but i donot understand ("salt" + str(ni) + str(x) + "salt).

2
  • Is it really the string concatenation (or str calls) that you don't understand? Or do you not understand why they're doing that, or how the lambda is used to wrap it up as a function, or something else other than what you asked? Commented Jun 11, 2015 at 23:44
  • I do not understand why the author is concatenating "salt". Commented Jun 11, 2015 at 23:50

2 Answers 2

2

it works fine but i donot understand ("salt" + str(ni) + str(x) + "salt).

This part is trivial.

First, the str functions takes any object and converts it to its string representation. For example, ni is going to be a number 0 or 1, which str will convert to the string "0" or "1".

Next, we're just concatenating four strings together: "a" + "bc" + "d" + "ef" gives you "abcdef".


I'm guessing that you're really asking is why you would do this.

When you're writing a hash function for some combination of values in terms of a simpler hash function, you want to be careful to make sure you include some kind of "salting", so your special combination of values doesn't accidentally hash to the same thing as a simple combination of the same values.

For an even simpler example, consider this class:

class Point(object): def __init__(self, x, y): self.x, self.y = x, y def __hash__(self): return hash((self.x, self.y)) 

But this means that hash((1.0, 2.0)) == hash(Point(1.0, 2.0)). Usually, you don't want that; a Point isn't the same thing as a tuple, it's a type with its own (very thin, but not nonexistent) semantics. So, you stick some extra value, called a "salt", into the hash. For example:

class Point(object): def __init__(self, x, y): self.x, self.y = x, y def __hash__(self): return hash((type(self), self.x, self.y)) 

And now, hash((1.0, 2.0)) != hash(Point(1.0, 2.0)).

Notice that this is distinct from, but not unrelated to, other important reasons for salting hashes (e.g., in cryptographic hashes, you can negotiate some shared random nonce to use as a salt, to make sure that nobody can reproduce the same hash results unless they have the negotiated salt, and you can use key exchange protocols to make sure they don't have it).


However, it's worth mentioning that this is a very silly hash function.

First, it's simpler, more robust, and more efficient to hash a tuple of values than to has a concatenated string. Most likely this code was written for some other language that didn't have a generic hash function, only a hash_string function.

Second, the only reason you'd want to both prepend and append a salt, instead of just one or the other, is if you didn't trust the hash function you're relying on to treat the parts of its values uniformly. And really, if you can't trust that, putting the salt on both ends doesn't help that much—and may actually hurt. (For example, if your hash_string undervalues everything after the first few characters, then prepending the salt does nicely avoid collision with unsalted values while appending it wouldn't—but it also means you're pushing 4 more actual characters out of the over-valued parts of the input, so your salted hashes are going to be even more badly distributed than normal hashes. If you really can't trust a hash function, you can't build a more complex hash function on top of it; you have to build your own.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for taking time to explain in detail.
1

The part you highlight

def hash_functions(): def hash_factory(ni): return(lambda x: hash("salt" + str(ni) + str(x) + "salt")) return [hash_factory(i) for i in range(2)] 

Is a function factory. Instead of hash_factory returning a value, it returns a function whose, erm, function depends on the value you pass it. Imagine something like:

def greeting_factory(greeting): return lambda name: greeting + name 

You could use this greeting to create other functions, for example:

say_hola = greeting_factory("Hola, ") say_bonjour = greeting_factory("Bonjour, ") say_hello = greeting_factory("Hello, ") say_howdy = greeting_factory("Howdy, ") 

Then use each of THOSE functions, which all expect one argument (the name of the person to say hello to)

>>> say_hola("Juan") Hola, Juan >>> say_bonjour("Jacques") Bonjour, Jacques >>> say_hello("James") Hello, James >>> say_howdy("pardner") Howdy, pardner 

In this case, it's using a function factory to build a couple different hashing functions, then returning the list. This is identical to:

def ways_to_say_hello(): def greeting_factory(greeting): return lambda name: greeting + name return [greeting_factory(greet) for greet in ['Hola, ', 'Bonjour, ', 'Hello, ', 'Howdy, ']] >>> for hello_func in ways_to_say_hello(): ... hello_func("Adam Smith") Hola, Adam Smith Bonjour, Adam Smith Hello, Adam Smith Howdy, Adam Smith 

4 Comments

Thank you for taking the time to explain. I am not much familiar with hash, I dont understand why the author is concatenating "salt". I understand str(ni) would change the hash value, but why is "salt" needed?
@Siddarth it's not. This is a terrible example of a hashing function. It also does silly things like for i, fn in enumerate(hash_functions()) when i is never used. For reasons why salting a hash is VITALLY important for more critical hashing functions (bcrypt, SHA-512, etc), you should review security.stackexchange.com/questions/51959/… which would be off-topic for stack exchange.
@Siddarth Nota Bene: if you're hashing something important, DON'T ROLL YOUR OWN HASH FUNCTION. If you're not hashing anything important, use python's built-in hash function (i.e. hash(t1)). There's never an actual reason to write your own hashing function unless you're not happy with the industry standards and have the experience, education, and have done the necessary research to improve on it.
Thank you, i will look into it. I now have a basic understanding of why he appending "salt"

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.