Predictions and forecasts based on historical data

Example of Big Data based on CAD (BIM) data

26 February 2024

Key concepts of machine learning

27 February 2024

Predictions and forecasts based on historical data

Collected data on the company's projects allows building models that predict the cost and time attributes of future, not yet existing projects without labor-intensive calculations.

Manual and semi-automated calculation of prices and timing attributes of the project in the future will be supplemented by the opinion and calculations of machine learning models. New data will be generated from existing information, similar to how ChatGPT creates new text, images and code based on existing data collected over years from all over the internet.

Just as we today use our experience and knowledge to predict future events, the future of construction projects will be shaped by a combination of old knowledge and the mathematical algorithms of machine learning models.

Machine learning is a class of methods for solving artificial intelligence problems. Machine learning algorithms recognize patterns in large data sets and use them for self-learning. Each new data set allows mathematical algorithms to improve and adapt according to the information obtained, which allows to constantly improve the accuracy of recommendations and predictions.

Qualitative and structured historical company data is the material from which predictions are made

As an example, let's take the problem of predicting the price of a house based on its size, land area, number of rooms and geographical location (latitude and longitude).

One way to solve the problem of predicting house price is to develop a classical algorithm that analyzes these four characteristics and calculates the expected price based on a specific rule or formula. Such a method requires precise knowledge of the formula for calculating house price, which is often impossible in practice.

To estimate the value of a house by size, plot area, number of rooms and geolocation, we can use a classical algorithm with a fixed formula

Alternatively, let us create a machine learning algorithm that generates a model based on prior understanding of the problem and historical data, which may be incomplete.

In the pricing example, machine learning is used to develop various mathematical algorithms and models based on incomplete knowledge of the price formation mechanism. The data-driven model adjusts and adapts itself to the cost and construction time data of similar projects.

Unlike classical formula-based estimation, the machine learning algorithm is trained on historical data

In the context of supervised machine learning, each project in the training dataset includes both input attributes (e.g., cost and time data of similar buildings) and expected output values (e.g., price or time). A similar dataset is used to create and tune the machine learning model. The larger the dataset and the better the quality of the data in the dataset, the more accurate the model will be and the more accurate the result of the predictions.

A machine learning model trained on hundreds of past project cost and schedule data will determine with some probability the price and schedule of a new project

Once the model has been created and trained, to estimate the cost or construction duration of a new project, it will be sufficient to provide the model with new attributes for the new project, and the model will provide estimated results, with some probability, based on the patterns learned previously.

Linear regression is a fundamental algorithm in data analysis, predicting the value of a variable based on the linear relationship with one or more other variables. This model assumes that there is a direct, linear connection between the dependent variable and one or more independent variables, and the goal of the algorithm is to find this relationship.

The simplicity and straightforwardness of linear regression have made it a staple in various fields for many decades. When dealing with a single variable, linear regression involves finding the best-fitting line through the data points.

This line is represented by an equation, where inputting a value for the independent variable (X) yields a predicted value for the dependent variable (Y). This process effectively allows for the prediction of Y based on known values of X, leveraging the linear dependency between them. An example of finding this statistically average line can be seen in the example of estimating San Francisco building permit data in Figure 4.4-15, where inflation was calculated for different types of facilities.

Let's load a picture table of project data (Figure 4.5-10 from the previous chapter) directly into ChatGPT and ask it to build a simple machine learning model for us.

❏ Text request to ChatGPT:

Need to show the construction of a primitive machine learning model for predicting the cost and time of a new project X (attached image)⏎

➤ ChatGPT Answer:

ChatGPT used simple linear regression to create a primitive machine learning model to predict the cost and construction time of a new construction project

ChatGPT identified the table in the attached image and converted the data from a visual format to an table array. This array was then used as the basis for creating a machine learning model.

Using basic linear regression model, which was trained on the "extremely small" dataset provided, predictions were made for a hypothetical new construction project, labelled Project X. In our task, this project is characterised by having 4 flats, 4 floors and a complexity level of 7 (Figure 4.5-10). According to the predictions made by a linear regression model based on a limited data set (Figure 4.5-11):

The construction time is estimated to be approximately 238 days (238.4444444)
The total cost is projected to be approximately $3,042,338 (3042337.777)

To further explore the project cost hypothesis, it is useful to experiment with different machine learning algorithms and techniques. Therefore, let's predict the same cost and time values from this small dataset using the K-Nearest Neighbours (k-NN) algorithm.