Your aim is to build and publish a cloud based property price application. The application will predict a properties current worth from past sales data. You have heard that Azure Machine Learning (AML) can help and so you want to gain understand of the AML *terminology* as you go along. Your application will request property fields* *as input. These include type: terraced, semi, detached, number of bedrooms, bathrooms, living rooms, postcode prefix, business or residential etc.

The first step is to obtain raw data for past property sales via the land registry or real estate company. If possible, every transaction over the last eighteen months. The raw data should include all the input fields, plus many others and of course the actual sale price. Each field in AML is known as a *feature*.

Lets imagine we have the following very simplified raw data in CSV file format:

id,mainrooms, bedrooms,bathrooms,value

1,1,1,1,250000

1,1,1,1,250000

2,2,2,1,290000

3,2,3,1,340000

4,0,4,2,400000

Next, is to create *prepared* data, by pre-processing the raw data. The pre-processing entails machine learning *modules* to de-duplicate rows, delete rows with missing data, and delete rows with invalid data. Hence for the above data we would discard two rows. The first row with id=1 and row with id=4.

Module processing can also include changing the type of the data, e.g from a string value to numeric value. Any column *features* that do not affect the worth of the property should be also be removed and any necessary maths operations could applied to the data.

The prepared data is then split 80:20, keeping the latter back as test input and the former now known as *training data*.

Next is to create an *experiment*. The data scientist by chooses a single algorithm from a set of pre-prepared machine learning algorithms. In this scenario the scientist is looking for an algorithm, that when executed, can analyse the *training* data, and return the best numerical combination of *features* that gives the closest predicted values (also known as *target* values or *labels*) against each actual value. Below is an extremely simplified algorithm that can demonstrate this:

*predicted value = a + (b * mainrooms) + (c * bedooms) + (d * bathrooms)*

Hence the algorithm uses the features of the training data (bedroom, bathroom, mainrooms etc) to determine the best values of a, b c and d, the *weightings. *

Let’s consider if machine learning internally determined the following values: a=150,000, b=20,000, c=30,000, d=40,000, then for row with id=3:

*predicted value = 150,000 + (20,000 * 2) + (30000* 3) + (40000* 1)
*The predicted value would be: £320,000 compare to actual value of £340,000.

When a machine learning algorithm execution completes, the candidate combination is internally saved as code and is known as the *model*. This process is usually repeated numerous times until the best model is found. This is known as *training the model*. For each saved *model* we can use the test data to evaluate the statistical accuracy of the *model*.

Hence our above our *candidate model* would be:

predicted value = 150000 + (mainrooms * 20000) + (bedrooms * 30000) + 40000 * bathrooms)

To evaluate the *candidate model* we can now add a *score model.* This model takes the 20% training data and applies it to the *candidate model* and produces an output where we can determine the statistical accuracy of our *candidate model* and also compare predicted values to actual values.

In our house example we could have started with any of the* regression* machine learning algorithms. This could give us good results as property prices might be linearly proportional to the values of property *features*.

Finally, when we have produced our best model we can deploy it to Azure as a web API.

The app then prompts for the input features, submits them to the API, the model returning a predicted property value and the app displaying this value.

The whole AML cycle can be demonstrated in the diagram below:

Taken from David Chappells Introduction for Technical Professionals

The below example is from an online tutorial but the design is the same

Further reading:

For this particular problem we have been looking to produce a linear or scalar value, such as money and so we could choose algorithms that fit this. These include: Bayesian Linear Regression, Boosted Decision Tree Regression, and Neural Network Regression. The choice of algorithm is crucial to gaining a satisfactory result.

A different problem might be to produce two or more *categories* of result. An email spam filtering *model* will input an email and return one of two values: spam or not spam. Or a *model* that determines a dog breed will return many different results. Both of these problems require *category* based algorithms such as: Multiclass Decision Jungle, Two-Class Boosted Decision Tree, and One-vs-All Multiclass.

Another problem might be that we have many bitmap images of petri dishes containing cultures that have been impregnated with potential new antibiotics. We will want to place each dish into *clusters* of; not effective, slightly effective, very effective using the K-Means Clustering algorithm.