Lambda comes from the Lambda Calculus and refers to anonymous functions in programming.
Why is this cool? It allows you to write quick throw away functions without naming them. It also provides a nice way to write closures. With that power you can do things like this.
Python
def adder(x):
return lambda y: x + y
add5 = adder(5)
add5(1)
6
As you can see from the snippet of Python, the function adder takes in an argument x, and returns an anonymous function, or lambda, that takes another argument y. That anonymous function allows you to create functions from functions. This is a simple example, but it should convey the power lambdas and closures have.
Examples in other languages
Perl 5
sub adder {
my ($x) = @_;
return sub {
my ($y) = @_;
$x + $y
}
}
my $add5 = adder(5);
print &$add5(1) == 6 ? "ok\n" : "not ok\n";
JavaScript
var adder = function (x) {
return function (y) {
return x + y;
};
};
add5 = adder(5);
add5(1) == 6
JavaScript (ES6)
const adder = x => y => x + y;
add5 = adder(5);
add5(1) == 6
Scheme
(define adder
(lambda (x)
(lambda (y)
(+ x y))))
(define add5
(adder 5))
(add5 1)
6
C# 3.5 or higher
Func<int, Func<int, int>> adder =
(int x) => (int y) => x + y; // `int` declarations optional
Func<int, int> add5 = adder(5);
var add6 = adder(6); // Using implicit typing
Debug.Assert(add5(1) == 6);
Debug.Assert(add6(-1) == 5);
// Closure example
int yEnclosed = 1;
Func<int, int> addWithClosure =
(x) => x + yEnclosed;
Debug.Assert(addWithClosure(2) == 3);
Swift
func adder(x: Int) -> (Int) -> Int{
return { y in x + y }
}
let add5 = adder(5)
add5(1)
6
PHP
$a = 1;
$b = 2;
$lambda = fn () => $a + $b;
echo $lambda();
Haskell
(\x y -> x + y)
Java see this post
// The following is an example of Predicate :
// a functional interface that takes an argument
// and returns a boolean primitive type.
Predicate<Integer> pred = x -> x % 2 == 0; // Tests if the parameter is even.
boolean result = pred.test(4); // true
Lua
adder = function(x)
return function(y)
return x + y
end
end
add5 = adder(5)
add5(1) == 6 -- true
Kotlin
val pred = { x: Int -> x % 2 == 0 }
val result = pred(4) // true
Ruby
Ruby is slightly different in that you cannot call a lambda using the exact same syntax as calling a function, but it still has lambdas.
def adder(x)
lambda { |y| x + y }
end
add5 = adder(5)
add5[1] == 6
Ruby being Ruby, there is a shorthand for lambdas, so you can define adder
this way:
def adder(x)
-> y { x + y }
end
R
adder <- function(x) {
function(y) x + y
}
add5 <- adder(5)
add5(1)
#> [1] 6
JavaScript has two number types: Number
and BigInt
.
The most frequently-used number type, Number
, is a 64-bit floating point IEEE 754 number.
The largest exact integral value of this type is Number.MAX_SAFE_INTEGER
, which is:
- 253-1, or
- +/- 9,007,199,254,740,991, or
- nine quadrillion seven trillion one hundred ninety-nine billion two hundred fifty-four million seven hundred forty thousand nine hundred ninety-one
To put this in perspective: one quadrillion bytes is a petabyte (or one thousand terabytes).
"Safe" in this context refers to the ability to represent integers exactly and to correctly compare them.
From the spec:
Note that all the positive and negative integers whose magnitude is no
greater than 253 are representable in the Number
type (indeed, the
integer 0 has two representations, +0 and -0).
To safely use integers larger than this, you need to use BigInt
, which has no upper bound.
Note that the bitwise operators and shift operators operate on 32-bit integers, so in that case, the max safe integer is 231-1, or 2,147,483,647.
const log = console.log
var x = 9007199254740992
var y = -x
log(x == x + 1) // true !
log(y == y - 1) // also true !
// Arithmetic operators work, but bitwise/shifts only operate on int32:
log(x / 2) // 4503599627370496
log(x >> 1) // 0
log(x | 1) // 1
Technical note on the subject of the number 9,007,199,254,740,992: There is an exact IEEE-754 representation of this value, and you can assign and read this value from a variable, so for very carefully chosen applications in the domain of integers less than or equal to this value, you could treat this as a maximum value.
In the general case, you must treat this IEEE-754 value as inexact, because it is ambiguous whether it is encoding the logical value 9,007,199,254,740,992 or 9,007,199,254,740,993.
Best Answer
I assume entropy was mentioned in the context of building decision trees.
To illustrate, imagine the task of learning to classify first-names into male/female groups. That is given a list of names each labeled with either
m
orf
, we want to learn a model that fits the data and can be used to predict the gender of a new unseen first-name.First step is deciding what features of the data are relevant to the target class we want to predict. Some example features include: first/last letter, length, number of vowels, does it end with a vowel, etc.. So after feature extraction, our data looks like:
The goal is to build a decision tree. An example of a tree would be:
basically each node represent a test performed on a single attribute, and we go left or right depending on the result of the test. We keep traversing the tree until we reach a leaf node which contains the class prediction (
m
orf
)So if we run the name Amro down this tree, we start by testing "is the length<7?" and the answer is yes, so we go down that branch. Following the branch, the next test "is the number of vowels<3?" again evaluates to true. This leads to a leaf node labeled
m
, and thus the prediction is male (which I happen to be, so the tree predicted the outcome correctly).The decision tree is built in a top-down fashion, but the question is how do you choose which attribute to split at each node? The answer is find the feature that best splits the target class into the purest possible children nodes (ie: nodes that don't contain a mix of both male and female, rather pure nodes with only one class).
This measure of purity is called the information. It represents the expected amount of information that would be needed to specify whether a new instance (first-name) should be classified male or female, given the example that reached the node. We calculate it based on the number of male and female classes at the node.
Entropy on the other hand is a measure of impurity (the opposite). It is defined for a binary class with values
a
/b
as:This binary entropy function is depicted in the figure below (random variable can take one of two values). It reaches its maximum when the probability is
p=1/2
, meaning thatp(X=a)=0.5
or similarlyp(X=b)=0.5
having a 50%/50% chance of being eithera
orb
(uncertainty is at a maximum). The entropy function is at zero minimum when probability isp=1
orp=0
with complete certainty (p(X=a)=1
orp(X=a)=0
respectively, latter impliesp(X=b)=1
).Of course the definition of entropy can be generalized for a discrete random variable X with N outcomes (not just two):
(the
log
in the formula is usually taken as the logarithm to the base 2)Back to our task of name classification, lets look at an example. Imagine at some point during the process of constructing the tree, we were considering the following split:
As you can see, before the split we had 9 males and 5 females, i.e.
P(m)=9/14
andP(f)=5/14
. According to the definition of entropy:Next we compare it with the entropy computed after considering the split by looking at two child branches. In the left branch of
ends-vowel=1
, we have:and the right branch of
ends-vowel=0
, we have:We combine the left/right entropies using the number of instances down each branch as weight factor (7 instances went left, and 7 instances went right), and get the final entropy after the split:
Now by comparing the entropy before and after the split, we obtain a measure of information gain, or how much information we gained by doing the split using that particular feature:
You can interpret the above calculation as following: by doing the split with the
end-vowels
feature, we were able to reduce uncertainty in the sub-tree prediction outcome by a small amount of 0.1518 (measured in bits as units of information).At each node of the tree, this calculation is performed for every feature, and the feature with the largest information gain is chosen for the split in a greedy manner (thus favoring features that produce pure splits with low uncertainty/entropy). This process is applied recursively from the root-node down, and stops when a leaf node contains instances all having the same class (no need to split it further).
Note that I skipped over some details which are beyond the scope of this post, including how to handle numeric features, missing values, overfitting and pruning trees, etc..