Demand forecasting with the regression model

Introduction

This tutorial will demonstrate how to use a deep learning-based regression model in BLOCKS. The regression model tries to determine the relationship between various input features and results variable (called the “label”). This model is often used to predict sales, customer numbers, and the like.

For the example in this tutorial, we’ll create a regression model using BLOCKS that uses input features for the weather and daytime minutes in the Tokyo region to predict for electricity demand.

General overview

This tutorial will show how to create a machine learning system in BLOCKS using the DataEditor and Flow Designer services. The basic steps will be as follows:

Prepare the initial data in Google Sheets and download it as a CSV file.
Upload the CSV file into the DataEditor and split it into training data and testing data.
Run a training from the DataEditor:
BLOCKS will “learn” the relationship between the input features (weather/daytime minutes data) and the label (electricity usage).
The results that we get from the training are called a model or trained model.
Make electricity demand predictions from a Flow Designer.

Testing a regression model

We’ll do the following to train and use a regression model to predict Tokyo electricity demand:

Prepare weather, daytime minutes, and electricity usage data for the Tokyo region as a CSV file.
Use the DataEditor to split the data into training data and testing data.
Train the model in the DataEditor.
Prepare a Flow Designer.
Create a processing flow for making predictions.
Execute the flow.
Check the results of the prediction in the DataEditor.

Good, properly formatted data is essential to machine learning and makes it possible to train models and make predictions. As such, the first step will be to gather and process our data to prepare it for use in machine learning.

We recommend using Google Chrome for this tutorial. BLOCKS also supports Firefox, but some steps in this tutorial may be slightly different if you aren’t using Google Chrome.

Preparing the data as a CSV file

The first step is to prepare the data we’ll use to train and test our regression model. We’ll prepare the initial data as a CSV file (UTF-8, without BOM).

The training data will consist of the following information for the Tokyo region:

High temperature
Low temperature
Sunlight hours
Average humidity
Daytime minutes
Electricity usage

During the training, BLOCKS will “learn” the relationship that the weather/daytime minutes data has on electricity usage.

Explanation of the relationship between features and labels

We refer to the input variables (high temp, low temp, sunlight hours, average humidity, and daytime minutes) as features. We refer to the dependent variable (electricity usage) as the label or results variable.

During the testing step, we’ll use feature data not used to train the model to predict electricity usage.

The data for testing the model should contain a column named key and the feature data. The key column contains values to identify each example (1 row) of features. Each value within the key column must be unique. In this example, we’ll use dates as our keys.

The initial data for this tutorial comes from the following sources:

High temperature, low temperature, sunlight hours, and average humidity data: Japan Meteorological Agency
Daytime minutes data:National Astronomical Observatory of Japan
Electricity usage data: Tokyo Electric Power Company Holdings (TEPCO)

The National Astronomical Observatory of Japan calculates daytime as the number of minutes between sunrise and sunset.

The time period for the data is shown in the following chart:

Data type	Time period
Training data	January 1, 2016 through November 30, 2018 (2 years, 11 months).
Testing data	December 1, 2018 through December 31, 2018 (1 month).

In an actual business use case, you would likely want to predict for future dates. However, this tutorial will demonstrate the testing step you would take before using your model in a business scenario. When testing a model, you predict for past data that wasn't used to train the model. By doing this, you can compare the results of the prediction with the actual data to evaluate the usefulness of the model.

We’ve the feature data you’ll need for this tutorial as a CSV file that you can download from the following link:

Data	Explanation
Sample weather data（weather_daytimeminutes.csv）	A CSV file containing weather data from the Japan Meteorological Agency and daytime minutes data from the National Astronomical Observatory of Japan. It contains data for the Tokyo, Japan region for the dates January 1, 2016 through December 31, 2018. Date High temperature Low temperature Sunlight hours Average humidity Daytime minutes Click the link in the left column to download the feature data. The Japan Meteorological Agency and National Astronomical Observatory of Japan assume no responsibility for any activity that uses this data.

Data

Explanation

Sample weather data（weather_daytimeminutes.csv）

A CSV file containing weather data from the Japan Meteorological Agency and daytime minutes data from the National Astronomical Observatory of Japan. It contains data for the Tokyo, Japan region for the dates January 1, 2016 through December 31, 2018.

Date
High temperature
Low temperature
Sunlight hours
Average humidity
Daytime minutes

Click the link in the left column to download the feature data.

The Japan Meteorological Agency and National Astronomical Observatory of Japan assume no responsibility for any activity that uses this data.

For the electricity demand data, you’ll need to download it directly from this TEPCO page by doing the following:

Right-click 2016.
Click Save Link As....
Select where to save the file and click Save.

Repeat these steps for the 2017 and 2018 links.

Once you’ve finished downloading all of the data, you’ll need to combine it with the feature data into one file that can be used in BLOCKS.

There are various ways to process the data and prepare it for use in machine learning, but this tutorial will follow the steps outlined in the image below. First, we’ll use Google Sheets to combine the feature and label data into one table and then download it as a CSV file. Next, we’ll upload that file into the DataEditor, cleanse the data, and split it into separate data for training and testing the model.

We’ll combine all of our feature and label data into one table in Google Sheets formatted as shown in the following chart. Note that the headers (actual headers are shown in parentheses) can only contain numbers, letters, or underscores (_).

(Click to enlarge image.)

The dates are not one of the features we’ll use to train the model, but they will be necessary when we test and evaluate it. We’ll use the dates in the testing data as the key column to identify each date’s example of data. This column must be named key.

To prepare the CSV file that we will upload to BLOCKS, open the weather_daytimeminutes.csv file in Google Sheets .

weather_daytimeminutes opened in Google Sheets

Add the electricity usage data to a new sheet by doing the following:

Click File.
Click Import.

Importing the electricity usage data (2)

Click Upload.

Importing the electricity usage data (3)

Click Select a file from your device.

Importing the electricity usage data (4)

Select the juyo-2016 file you downloaded.
Click Open.

Importing the electricity usage data (5)

Click the Insert new sheet(s).
Click Import data.

Repeat this process for the 2017 and 2018 files, but do the following on the Import file section:

Importing the electricity usage data (6)

Click Append to current sheet.
Click Import data.

All of the electricity demand data should now be imported into the juyo-2016 sheet. However, each file contained header rows, so delete these from each year’s data by doing the following:

Deleting the header rows from the electricity data

Select rows 1–3.
Click Edit.
Click Delete rows 1–3.

The next header rows to delete are on rows 6601–6603. You can jump to these rows quickly by doing the following:

Click Edit and select Find and replace.
Enter UPDATE in the Find field.
Click Done.

Deleting the header rows from the electricity data (2)

Select rows 6601–6603.
Click Edit.
Click Delete rows 6601–6603.

Repeat the previous steps to find the next set of header rows and do the following:

Select rows 15361–15363.
Click Edit.
Click Delete rows 15361–15363.

Click the weather_daytimeminutes tab.
Enter electricity_usage into cell G1.

Entering a formula to calculate daily electricity usage

Enter =SUMIF('juyo-2016'!$A:$A,A2,'juyo-2016'!$C:$C) into cell G2.

This will calculate the amount of electricity used on April 1, 2016 and enter it into cell G2 as shown in the following image:

The electricity data calculated for the first day

You can use this same formula to calculate electricity usage for the rest of the dates by doing the following:

Applying the formula to the rest of the column

Select cell G2 and double click the square in its bottom-right corner.

This will fill in the rest of the electricity_usage column with data for each date as shown in the following image:

Now the the data is ready, we’ll rename the sheet and file to make it more easily identifiable, then download it as a CSV file. First, do the following:

Click weather_daytimeminutes and rename the file to Electricity Demand.
Click the triangle icon on the right side of the weather_daytimeminutes tab, select Rename, then rename the tab to Initial Data.

Click File.
Click Download as.
Click Comma-separated values (.csv, current sheet).

This will download the data to your computer as Electricity Demand - Initial Data.csv.

You’re now done using Google Sheets for the initial data preparation. In the next section, we’ll import the CSV file into the DataEditor in BLOCKS.

Splitting the data in the DataEditor

Now that the initial data is ready, we need to cleanse it (remove missing values) and split it into separate data for training and testing the model. We can do all of this in the DataEditor in BLOCKS.

First, sign in to BLOCKS if you haven’t already. Then, do the following:

Click the menu icon () in the global navigation bar.
Click DataEditor.

Import the data CSV file into the DataEditor by doing the following:

Clicking the import button in the DataEditor

Click Import.

Configuring how to import the file into the DataEditor

Click Upload.
Select the Google Cloud Storage (GCS) location to upload the file into. We’ll use a bucket that ends with -data.
Drag and drop the Electricity Demand - Initial Data.csv file into the field or click and select the file.
Select the dataset that will store the data or click ＋ to create a new dataset. We’ll use a dataset called tutorials.
Enter a name for the table. We’ll name our table electricity_demand_data.
Enter a name that will identify the data in the DataEditor. We’ll use the name Electricity Demand.
Click Import.

The DataEditor is a tool for visualizing and processing data that is stored in BigQuery. You don’t need to have any specialized knowledge of BigQuery to use the DataEditor. However, you do need to specify the BigQuery dataset and table that will store your data. If you are familiar with spreadsheet software, a BigQuery dataset could be compared to a workbook, and a BigQuery table to a single sheet.

Click Open.

Click Table.

Click .
Click Show missing values.

Examples of the mark that shows when a column contains missing values

A will appear for any columns that contain missing values, as shown in the image above. For this data, the DataEditor will find missing values in the sunlight_hours and average_humidity columns.

We can’t use examples (rows) that contain missing values when training the model, so delete these rows by doing the following:

Deleting rows with missing values from a column

Click Edit column.
Click Missing values.
Click Delete rows with missing values.

Repeat these steps for the average_humidity column.

The DataEditor keeps track of every edit that you make. You can view a list of these changes by doing the following:

Click Changes.
Confirm that the list of changes contains the following:
- Delete missing values in sunlight_hours
- Delete missing values in average_humidity

You can click the × next to an item to revert that edit.

Save your changes by doing the following:

Click .
Click Overwrite.

Not that you’ve cleansed the data of rows with missing values, split it into the training and testing data by doing the following:

Click Edit Table.
Click Split table.

Configuring how to split the data into training and testing sets

Click Custom.
Select key.
Click the slider icon.
Adjust the slider to the left until it shows 2018-11-30. You can also enter this date into the field directly.
Uncheck key to remove the dates from the training data.
Uncheck electricity_usage to remove the label column from the testing data.
Click Split.

Click OK.

The training and testing sets are organized as shown below:

In the following section, you’ll use the training data to train a regression model.

Training the model from the DataEditor

Use the training data you prepared to create a regression model by doing the following:

Click Electricity Demand_train.

If necessary, switch to the Table tab.
Click Create Model.
Click Regression (Deep Learning).

Designate the Google Cloud Storage bucket (first time only).
Enter the folder that will store your Deep Learning models and click Create Folder (first time only).
Click Create.

Message that shows your training has started

Click Close.

Depending on server circumstances, it should take about 4–5 hours to train the model. You can check on the training’s progress from the model list or from the model’s details screen.

Switch to the model list by doing the following:

Going back to the DataEditor home screen

Click <.

Click Models.

You can view a progress bar for the training in the RMSE/Accuracy column.

Clicking the training’s name (❶) will open its details page. You can also click the icon next to the progress bar (❷) to refresh it.

A value will appear in the RMSE/Accuracy column when the training finishes. In the next section, we’ll test the model by using it to make predictions in a Flow Designer.

Making predictions from a Flow Designer

This section will explain how to use a Flow Designer to make predictions.

You can also run predictions from the DataEditor, however it does not support batch predictions or setting automated schedules for running predictions like the Flow Designer does.

For more details on making predictions from the DataEditor, refer to Creating models and making predictions in the DataEditor.

Preparing a Flow Designer

In this section, we’ll use the Flow Templates feature in the Flow Designer to make predictions using the trained model. If you don’t have any Flow Designers yet, create one by doing the following:

Click the menu icon () in the global navigation bar.
Click Flow Designer.

Click Start.

If you’ve already created a Flow Designer in your project, you’ll see the Flow Designer list instead of the page shown above. In this case, you can either use an existing Flow Designer or create a new one by clicking the “Add” button if you have sufficient licenses.

A message will appear if you don’t have enough licenses to create the Flow Designer. If you are an admin user in your organization, you’ll be given the option to purchase more licenses. Otherwise, you’ll be prompted to contact your organization’s admins.

Enter a name for the Flow Designer. We’ll use the name Tutorials.
Select the language that will be used for log messages.
Select your time zone.
Click Create.

Click the name of your Flow Designer from the list to open it in a new tab.

Creating a flow for predictions

In this section, you’ll create a processing flow that can make predictions with the trained model by doing the following:

Click Flow Templates.

Selecting the type of prediction to make

Click Numerical regression.
Click Next.

Enter a name for the flow. We’ll use Electricity Demand Prediction.
Click Next.

Select the model you trained in the DataEditor. Ours is named Electricity Demand_train.
Click Online prediction.
Click Next.

Click DataEditor.
Click the name of your test data. Ours is named Electricity Demand_test.
Click Next.

Click DataEditor.
Click Register as new.
Enter the name that will be used to identify the results data in the DataEditor. We’ll use Electricity Demand_test_result.
Enter the dataset that will store the results. We’ll use tutorials.
Enter the table that will store the results. We’ll use predict_electricity_demand_test_result.
Click Next.

This will output the results to the DataEditor as Electricity Demand_test_result.

Click Create.

This will create the flow into your current tab in the Flow Designer. Save the new flow by doing the following:

Click Save.

Executing the flow

With the Flow ready, you can execute it to make predictions by doing the following:

Click the menu icon () on the right side of the Electricity Demand Prediction (Start of flow) BLOCK.
Click Execute Flow.

Click View logs to open the logs panel and check the status of the flow while it executes.

Example of a flow with the running status

You can see the status of your Flow in the log list on the left side of the log panel. Its status will be Running (❶) while it executes. Wait for a bit of time until it finishes.

Confirming that the flow executed sucessfully

The status will change to a green Finished (❶) if the flow finishes executing successfully.

Checking the results in the DataEditor

The results of the prediction will be sent to the DataEditor, so return switch back to the BLOCKS tab and do the following:

Click the menu icon () in the global navigation bar.
Click DataEditor.

Click the name for your results (you may need to click the reload button next to Import). Ours are Electricity Demand_test_result.

Click Table.
Click View data.

The following chart explains the meaning of each column:

Name	Explanation
key	The keys you configured for the prediction data. In this example, the keys are dates from 2018/12/1–2018/12/31.
output	The value predicted for the corresponding key. In this example, this is the amount of electricity demand predicted for the corresponding date in the key column.

Name

Explanation

key

The keys you configured for the prediction data.

In this example, the keys are dates from 2018/12/1–2018/12/31.

output

The value predicted for the corresponding key.

In this example, this is the amount of electricity demand predicted for the corresponding date in the key column.

We can now compare the predicted values with the actual electricity usage values for the same dates.

First, you’ll need to join the predicted and actual values into one table. To do this, you’ll need to convert the prediction result data’s key column from STRING to DATE type to match the initial data’s key column. To do this, do the following:

Changing the data type for the key column

Click Edit column for the key column.
Click Change type.

Click DATE.
Click YYYY-MM-DD.
Click OK.

Save the changes by doing the following:

Overwriting the table with the new changes

Click .
Click Overwrite.

Join together this data with the actual electricity usage data by doing the following:

Click Edit Table.
Click Merge table.

Select to join this data with the actual electricity usage by doing the following:

Click Select Join Table.
Click Electricity Demand.

The join type should be set to Inner join (INNER JOIN) by default.

Click the checkbox next to Column name to select all of the columns under Electricity Demand_test_result.
Click the checkbox for electricity_usage under Electricity Demand.
Click Add.
Click key.
Click key.
Click Join.

Configure the location where the merged data will be stored by doing the following:

Configuring the location for the merged data

Enter a name for the table. We’ll use Electricity Demand Compare Actual and Predicted.
Select the dataset. We’ll use tutorials.
Enter the table ID. We’ll use predict_electricity_demand_compare.
Click OK.

Click OK.

Now that the data has been merged into one table, you can compare the actual and predicted electricity usage as a graph by doing the following:

Click Electricity Demand Compare Actual and Predicted.

Click Data Analysis.

Select the data to compare in the graph by doing the following:

For the columns field, select key.
In the rows field, click to expand output and select output (Sum).
In the rows field, click to expand electricity_usage and select electricity_usage (Sum).

Click the button in the center of the screen.

Switch to a line graph of the data by doing the following:

Click Table.
Click Line.

With a graph like this, you can easily compare the predicted and actual electricity usage and evaluate the effectiveness of your model.

However, judging whether or not your model is accurate enough depends on how you intend to use it. The example in this tutorial isn’t intended for any real use case, so we can’t say if it’s accurate enough or not. However, we will mention a few ways one could improve this model in the Summary section below.

With that, we’ve finished using BLOCKS to train a regression model, make predictions, and evaluate the results. We’ll explain what to do in case of errors in the following section.

In case of an error

If an error occurs during a training on the DataEditor, you can find the error logs by doing the following:

Clicking a training that has failed from the model list

Click the name of the model whose RMSE/Accuracy column shows Failed.

Clicking on the training to open up its details screen

Click the name of the training.

Click Error logs.
Click Copy error logs to clipboard.

If an error occurs on the Flow Designer, you can find the error logs in the logs panel. If you need to copy the error logs to contact BLOCKS Support, do the following:

Select the logs with the status Failed.
Click Show error log details.
Click the button to copy the logs.

Error messages in the Flow Designer are shown in red, but it’s often helpful to read the logs before and after the red error message.

If you encounter an error that you cannot solve after several attempts, you can contact the BLOCKS Support by clicking your user icon in the global navigation bar and selecting Contact Us. Please copy the entire contents of your error logs—not just the red lines—and include these either as a text file or within your message. For errors in a Flow Designer, you should also export your Flow as a JSON file and include this as an attachment in your message to BLOCKS support.

For more details on contacting BLOCKS Support, refer to the Basic Guide: Contact Us page.

Summary

With BLOCKS, training a regression model and making predictions doesn’t require specialized machine learning knowledge. All you need to do is prepare your data. However, there are a few points to keep in mind when doing this. In order to be usable by BLOCKS, your data should be a CSV file with UTF-8 encoding (without BOM).

While this tutorial doesn’t evaluate whether the trained model is accurate enough or not, it could certainly be improved by re-thinking the input features.

For example, there’s a strong possibility that electricity usage would be much lower on weekends and holidays when many companies are closed. You could try to improve on the model from this tutorial by adding features like the day of the week or whether or not it’s a holiday.

As with this tutorial, using machine learning to solve real business problems starts with gathering data. You may be able to use data your company already has, or you may need to take steps to gather or purchase new data. You then need to examine your data and determine which features to use, cleanse it of irregular values, and prepare it in the correct format for training a model. It's not an exaggeration to say that collecting and processing data makes up the better part of doing machine learning.