Python – Sequential k-means clustering using scikit-learn

cluster-analysismachine learningpythonscikit-learn

Is there a way to perform sequential k-means clustering using scikit-learn? I can't seem to find a proper way to add new data, without re-fitting all the data.

Thank you

Best Answer

scikit-learn's KMeans class has a predict method that, given some (new) points, determines which of the clusters these points would belong to. Calling this method does not change the cluster centroids.

If you do want the centroids to be changed by the addition of new data, i.e. you want to do clustering in an online setting, use the MiniBatchKMeans estimator and its partial_fit method.

Related Solutions

Python – How to split a list into evenly sized chunks

Here's a generator that yields the chunks you want:

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

import pprint
pprint.pprint(list(chunks(range(10, 75), 10)))
[[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
 [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
 [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
 [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
 [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
 [70, 71, 72, 73, 74]]

If you're using Python 2, you should use xrange() instead of range():

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in xrange(0, len(lst), n):
        yield lst[i:i + n]

Also you can simply use list comprehension instead of writing a function, though it's a good idea to encapsulate operations like this in named functions so that your code is easier to understand. Python 3:

[lst[i:i + n] for i in range(0, len(lst), n)]

Python 2 version:

[lst[i:i + n] for i in xrange(0, len(lst), n)]

Python – Using global variables in a function

You can use a global variable within other functions by declaring it as global within each function that assigns a value to it:

globvar = 0

def set_globvar_to_one():
    global globvar    # Needed to modify global copy of globvar
    globvar = 1

def print_globvar():
    print(globvar)     # No need for global declaration to read value of globvar

set_globvar_to_one()
print_globvar()       # Prints 1

Since global variables have a long history of introducing bugs (in every programming language), Python wants to make sure that you understand the risks by forcing you to explicitly use the global keyword.

See other answers if you want to share a global variable across modules.

Best Answer

Related Solutions

Python – How to split a list into evenly sized chunks

Python – Using global variables in a function

Related Topic