Thursday 18 December 2014

Competency 6.1

Competency 6.1: Learn how to engineer both features and training labels
  
Feature Engineering
Feature engineering is the art of creating predictor variables. The model will not be good if our features (predictors) are not good. It involves lore rather than well-known and validated principles.It is the “massaging” (transformation) or “finishing touch” (organisation) of raw data into features that better represent the underlying problem to a model so that the model can learn a solution to a problem from data (or provide improved and accurate prediction of the unseen data).

We need good features to highlight the constructs inherent in the data.  With good features, one can choose the “wrong” model and still get “good” results.  The flexibility of good features allows us to use less sophisticated models that are faster to run, easier to maintain and more friendly to understand.  We are then closer to the underlying problem with the features more readily illuminating the problem.

The big idea is how we can take the voluminous, ill-formed and yet under-specified data that we now have in education and shape it into a reasonable set of variables in an efficient and predictive way.

Process:
  1. Brainstorming features - IDEO tips for brainstorming
  2. Deciding what features to create - trade-off between effort and usefulness of feature
  3. Creating the features - Excel, OpenRefine, Distillation code
  4. Studying the impact of features on model goodness
  5. Iterating on features if useful - try close variants and test
  6. Go to 3 (or 1)
Feature engineering can over-fit --> Iterate and use cross-validation, test on held-out data or newly collected data.Thinking about our variables is likely to yield better results than using pre-existing variables from a standard set.
There are different approaches to engineer features and learning labels.  Here are some classical ones:
Feature Extraction
An automatic process to reduce the dimensionality of observations into smaller set that can be modelled.

Feature Selection
An automatic process to select a subset most relevant to the problem (e.g. scoring)

Manual Construction
A manual process to craft features in a way to expose them to the model (e.g. combining or decomposing features to create new ones)

No comments:

Post a Comment