Big Data: analyzing data from a million building permits in San Francisco
26 February 2024Predictions and forecasts based on historical data
26 February 2024In this chapter, we will and analyze a large dataset using data from various CAD (BIM) systems. A specialized automated web crawler, configured to automatically search and collect design files from websites offering free architectural models in Revit® and IFC formats, was used to create a large dataset. The crawlers successfully found and downloaded 4,596 IFC files and 6,471 Revit® files in a few days.
After collecting projects, in Revit® and IFC formats of different versions and after conversion using free reverse engineering SDKs, thousands of designs were collected into one big dataframe - the data form we talked about in the paragraph "DataFrame is a popular data structure".
The data from this large-scale collection contains the following details: the IFC files dataset contains approximately 4 million elements (rows) and 24,962 attributes (columns), and the Revit® files dataset of approximately 6 million elements (rows) contains 27,025 different attributes (columns).
These information sets span millions of elements, for each of which BoundingBox geometries were additionally derived (a rectangle that defines the boundaries of an object in the project) were additionally obtained and images of each element in PNG format and geometry in open XML - DAE (Collada) format were created.
In this way we got all information on tens of millions of elements from 4,596 IFC projects and 6,471 Revit® projects where all properties of all elements and their geometry (BoundingBox) were translated into a structured form of just one table (DataFrame).
Such huge amounts of structured data open up limitless possibilities for analysis and decision-making. An example of analyzing this dataset is the "5000 projects IFC and Revit®" project, available on Kaggle. This ready-made Jupyter Notebook project includes not only code of the solutions and but also a clear visualization of the analytical results.
Using the Pandas library (which we discussed in detail in the chapter "Pandas: An Indispensable Tool in Data Analysis"), a general analysis of all elements from all projects was performed.
For example, using project meta-information, it is possible to determine for which cities projects have been created by displaying this information on a map. From this dataset, as well, find out at what time of day, week, or month project files were saved or modified. Using BoundingBox geometry data, visualized all columns from all projects in one image in just a few seconds, and showed the dimensions of all windows from various projects from around the world in one image.
The Pipeline analytics code for this example and the dataset itself is available on the Kaggle website under the name "5000 projects IFC and Revit® | DataDrivenConstruction.io". This Pipeline together with the dataset can be copied and run in one of the popular IDEs: PyCharm, Visual Studio Code (VS Code), Jupyter Notebook, Spyder, Atom, Sublime Text, Eclipse with PyDev plugin, Thonny, Wing IDE, IntelliJ IDEA with Python plugin, JupyterLab or popular online tools Kaggle.com, Google Collab, Microsoft Azure Notebooks, Amazon SageMaker.
With this information analysis using data from past projects, professionals can effectively predict material and labour requirements and optimise design decisions before construction begins.
The analytics derived from the processing and exploration of massive amounts of structured data are crucial in the decision-making process in the construction industry.
But back to the realities of the construction industry: while self-driving cabs are already popping up in San Francisco, the companies working in the construction industry are still manual and often paper-based businesses where decisions are based more on intuition and experience than on data.
Recently, the construction sector in most countries has faced high central bank interest rates, cooling investor activity and falling margins to 1-2% of project costs. And with the introduction of new requirements to hand over CAD (BIM) models to clients or banks issuing construction loans, the opening up of volume data makes traditional speculation on volume and cost effectively unrealistic. All these financial difficulties, as well as the risk of bankruptcy and lack of liquidity, will push construction companies to incorporate automation, big data and analytics into their operations as the only tools to survive and improve efficiency in today's realities.
The availability of large amounts of open data and the realisation that predictive models can be created from it opens the door to the world of statistics, forecasting and machine learning.