146 Big data in construction from intuition to predictability
10 June 2025
Picture 8
148 Big data analyzing data from San Francisco’s million building permit dataset
10 June 2025

147 Questioning the feasibility of big data correlation, statistics and data sampling

Traditionally, construction was based on subjective hypotheses and personal experience. Engineers assumed – with a certain degree of probability – how the material would behave, what loads the structure would withstand and how long the project would last. These assumptions were tested in practice, often at the cost of time, resources and future risks.

With the advent of big data, the approach is changing dramatically: decisions are no longer made on the basis of intuitive guesses, but as a result of analyzing large-scale data sets. Construction is gradually ceasing to be an art of intuition and becoming a precise science of prediction.

The transition to the idea of using big data inevitably raises an important question: how critical is the amount of data and how much information is really necessary for reliable predictive analytics? The widespread belief that “the more data, the higher the accuracy” does not always prove to be statistically valid in practice.

Back in 1934, statistician Jerzy Neumann proved (J. Neyman, On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection, Oxford University Press, 1934) that the key to the accuracy of statistical inference lies not so much in the amount of data as in its representativeness and randomness of sampling.

This is especially true in the construction industry, where large masses of data are collected using IoT -sensors, scanners, surveillance cameras, drones and even multi-format CAD -models, increasing the risk of blind spots, outliers and data distortions.

Let’s consider an example of road pavement condition monitoring. A complete data set of all road segments may take X GB and take about a day to process. At the same time, a random sample including only every 50th road section will take only X/50 GB and will be processed in half an hour, while providing similar accuracy of estimates for certain calculations (Fig. 9.1-1).

Рисунок 8
Fig. 9.1-1 Pavement condition histograms: full data set and random sampling show identical results.

Thus, the key to successful data analysis may often not be the amount of data, but the representativeness of the sample and the quality of the processing methods used. The move to random sampling and a more selective approach requires a shift in thinking in the construction industry. Historically, companies have followed the logic of “the more data the better,” believing that covering all possible indicators would maximize accuracy.

This approach is reminiscent of a popular misconception from project management: “the more specialists I attract, the more effective the work will be”. However, as with staffing, it is quality and tools that are more important than quantity. Without considering the interrelationships (correlations) between data or project participants, increasing volume can only lead to noise, distortion, duplication, and unnecessary waste.

In the end, it often turns out that it is much more productive to have a smaller, but qualitatively prepared data set capable of producing stable and reasonable forecasts than to rely on massive but chaotic information containing many contradictory signals.

Excessive data volume not only does not guarantee greater accuracy, but can also lead to distorted conclusions due to the presence of noise, redundant features, hidden correlations and irrelevant information. In such conditions, the risk of overfitting models increases and the reliability of analytical results decreases.

In construction, a major challenge in dealing with big data is determining the optimal quantity and quality of data. For example, when monitoring the condition of concrete structures, using thousands of sensors and collecting information every minute can overwhelm the storage and analysis system. However, if you perform a correlation analysis and select the 10% most informative sensors, you can get almost identical prediction accuracy, spending many times, sometimes tens or hundreds of times, fewer resources.

Using a smaller subset of data reduces both the amount of storage required and the processing time, which significantly reduces the cost of storing and analyzing data and often makes random sampling an ideal solution for predictive analytics, especially in large infrastructure projects or when working in real-time. Ultimately, the efficiency of construction processes is not determined by the amount of data collected, but by the quality of its analysis. Without a critical approach and careful analysis, data can lead to incorrect conclusions.

After a certain amount of data, each new unit of information yields less and less useful results. Instead of endlessly collecting information, it is important to focus on its representativeness and methods of analysis (Fig. 9.2-2).

This phenomenon is well described by Allen Wallis (T. J. S. a. J. S. Jesse Perla, “A Problem that Stumped Milton Friedman,” Quantitative Economics with Julia, 1 Jan. 2025), who illustrates the use of statistical methods using the example of testing two alternative U.S. Navy projectile designs.

The Navy tested two alternative projectile designs (A and B) by conducting a series of paired rounds. In each round, A receives a 1 or 0 depending on whether its performance is better or worse than that of B, and vice versa. The standard statistical approach involves conducting a fixed number of trials (e.g., 1000) and determining the winner based on a percentage distribution (e.g., if A gets a 1 more than 53% of the time, it is considered the best). When Allen Wallis discussed such a problem with (Navy) Captain Garrett L. Schuyler, the captain objected that such a test, to quote Allen’s story, might be useless. If a wise and experienced ordnance officer such as Schuyler had been on the spot, he would have seen after the first few hundred [shots] that the experiment need not be terminated either because the new method is clearly inferior or because it is clearly superior to what was hoped for (T. J. S. a. J. S. Jesse Perla, “A Problem that Stumped Milton Friedman,” Quantitative Economics with Julia, 1 Jan. 2025).

– U.S. Government Statistical Research Group at Columbia University, World War II period

This principle is widely used in various industries. In medicine, for example, clinical trials of new drugs are conducted on random samples of patients, which makes it possible to obtain statistically significant results without the need to test the drug on the entire population of people living on the planet. In economics and sociology, representative surveys are conducted to reflect the opinion of society without the need to interview every person in the country.

Just as governments and research organizations conduct surveys of small populations to understand general social trends, companies in the construction industry can use random data samples to effectively monitor and create forecasts for project management (Fig. 9.1-1).

Big data may change the approach to social science, but it will not replace statistical common sense(Т. Landsall-Welfair, Predicting the nation’s current mood, Significance, 2012).

– Thomas Landsall-Welfair, “Predicting the nation’s current mood,” Significance v. 9 (4), 2012

From a resource-saving perspective, when collecting data for future predictions and decision-making, it is important to answer the question: does it make sense to spend significant resources to collect and process huge data sets when a much smaller and cheaper test data set that can be scaled up incrementally can be used? The effectiveness of random sampling shows that companies can reduce costs by tens or even thousands of times in collecting and training models by choosing data collection methods that do not require comprehensive coverage, but still provide sufficient accuracy and representativeness. This approach allows even small companies to achieve results on par with large corporations using significantly fewer resources and data volumes, which is important for companies looking to optimize costs and accelerate informed decision making using small resources. In the following chapters, explore examples of analytics and predictive analytics based on public datasets using big data tools.

.

Change language

Post's Highlights

Stay updated: news and insights



We’re Here to Help

Fresh solutions are released through our social channels

Leave a Reply

Your email address will not be published. Required fields are marked *

Focus Areas

navigate
  • ALL THE CHAPTERS IN THIS PART
  • A PRACTICAL GUIDE TO IMPLEMENTING A DATA-DRIVEN APPROACH (8)
  • CLASSIFICATION AND INTEGRATION: A COMMON LANGUAGE FOR CONSTRUCTION DATA (8)
  • DATA FLOW WITHOUT MANUAL EFFORT: WHY ETL (8)
  • DATA INFRASTRUCTURE: FROM STORAGE FORMATS TO DIGITAL REPOSITORIES (8)
  • DATA UNIFICATION AND STRUCTURING (7)
  • SYSTEMATIZATION OF REQUIREMENTS AND VALIDATION OF INFORMATION (7)
  • COST CALCULATIONS AND ESTIMATES FOR CONSTRUCTION PROJECTS (6)
  • EMERGENCE OF BIM-CONCEPTS IN THE CONSTRUCTION INDUSTRY (6)
  • MACHINE LEARNING AND PREDICTIONS (6)
  • BIG DATA AND ITS ANALYSIS (5)
  • DATA ANALYTICS AND DATA-DRIVEN DECISION-MAKING (5)
  • DATA CONVERSION INTO A STRUCTURED FORM (5)
  • DESIGN PARAMETERIZATION AND USE OF LLM FOR CAD OPERATION (5)
  • GEOMETRY IN CONSTRUCTION: FROM LINES TO CUBIC METERS (5)
  • LLM AND THEIR ROLE IN DATA PROCESSING AND BUSINESS PROCESSES (5)
  • ORCHESTRATION OF ETL AND WORKFLOWS: PRACTICAL SOLUTIONS (5)
  • SURVIVAL STRATEGIES: BUILDING COMPETITIVE ADVANTAGE (5)
  • 4D-6D and Calculation of Carbon Dioxide Emissions (4)
  • CONSTRUCTION ERP AND PMIS SYSTEMS (4)
  • COST AND SCHEDULE FORECASTING USING MACHINE LEARNING (4)
  • DATA WAREHOUSE MANAGEMENT AND CHAOS PREVENTION (4)
  • EVOLUTION OF DATA USE IN THE CONSTRUCTION INDUSTRY (4)
  • IDE WITH LLM SUPPORT AND FUTURE PROGRAMMING CHANGES (4)
  • QUANTITY TAKE-OFF AND AUTOMATIC CREATION OF ESTIMATES AND SCHEDULES (4)
  • THE DIGITAL REVOLUTION AND THE EXPLOSION OF DATA (4)
  • Uncategorized (4)
  • CLOSED PROJECT FORMATS AND INTEROPERABILITY ISSUES (3)
  • MANAGEMENT SYSTEMS IN CONSTRUCTION (3)
  • AUTOMATIC ETL CONVEYOR (PIPELINE) (2)

Search

Search

057 Speed of decision making depends on data quality

Today’s design data architecture is undergoing fundamental changes. The industry is moving away from bulky, isolated models and closed formats towards more flexible, machine-readable structures focused on analytics, integration and process automation. However, the transition...

060 A common language of construction the role of classifiers in digital transformation

In the context of digitalization and automation of inspection and processing processes, a special role is played by classification systems elements – a kind of “digital dictionaries” that ensure uniformity in the description and parameterization...

061 Masterformat, OmniClass, Uniclass and CoClass the evolution of classification systems

Historically, construction element and work classifiers have evolved in three generations, each reflecting the level of available technology and the current needs of the industry in a particular time period (Fig. 4.2-8): First generation (early...

Don't miss the new solutions

 

 

Linux

macOS

Looking for the Linux or MAC version? Send us a quick message using the button below, and we’ll guide you through the process!


📥 Download OnePager

Welcome to DataDrivenConstruction—where data meets innovation in the construction industry. Our One-Pager offers a concise overview of how our data-driven solutions can transform your projects, enhance efficiency, and drive sustainable growth. 

🚀 Welcome to the future of data in construction!

You're taking your first step into the world of open data, working with normalized, structured data—the foundation of data analytics and modern automation tools.

By downloading, you agree to the DataDrivenConstruction terms of use 

Stay ahead with the latest updates on converters, tools, AI, LLM
and data analytics in construction — Subscribe now!

🚀 Welcome to the future of data in construction!

You're taking your first step into the world of open data, working with normalized, structured data—the foundation of data analytics and modern automation tools.

By downloading, you agree to the DataDrivenConstruction terms of use 

Stay ahead with the latest updates on converters, tools, AI, LLM
and data analytics in construction — Subscribe now!

🚀 Welcome to the future of data in construction!

You're taking your first step into the world of open data, working with normalized, structured data—the foundation of data analytics and modern automation tools.

By downloading, you agree to the DataDrivenConstruction terms of use 

Stay ahead with the latest updates on converters, tools, AI, LLM
and data analytics in construction — Subscribe now!

🚀 Welcome to the future of data in construction!

You're taking your first step into the world of open data, working with normalized, structured data—the foundation of data analytics and modern automation tools.

By downloading, you agree to the DataDrivenConstruction terms of use 

Stay ahead with the latest updates on converters, tools, AI, LLM
and data analytics in construction — Subscribe now!

🚀 Welcome to the future of data in construction!

You're taking your first step into the world of open data, working with normalized, structured data—the foundation of data analytics and modern automation tools.

By downloading, you agree to the DDC terms of use 

🚀 Welcome to the future of data in construction!

You're taking your first step into the world of open data, working with normalized, structured data—the foundation of data analytics and modern automation tools.

By downloading, you agree to the DataDrivenConstruction terms of use 

Stay ahead with the latest updates on converters, tools, AI, LLM
and data analytics in construction — Subscribe now!

DataDrivenConstruction offers workshops tested and practiced on global leaders in the construction industry to help your team navigate and leverage the power of data and artificial intelligence in your company's decision making.

Reserve your spot now to rethink your
approach to decision making!

Please enable JavaScript in your browser to complete this form.

 

🚀 Welcome to the future of data in construction!

By downloading, you agree to the DataDrivenConstruction terms of use 

Stay ahead with the latest updates on converters, tools, AI, LLM
and data analytics in construction — Subscribe now!

Have a question or need more information? Reach out to us directly!
Schedule a time to discuss your needs with our team.
Tailored sessions to help your team grow — let's plan together!
Have you attended one of our workshops, read our book, or used our solutions? Share your thoughts with us!
Please enable JavaScript in your browser to complete this form.
Name
Data Maturity Diagnostics

🧰 Data-Driven Readiness Check

This short assessment will help you identify your company's data management pain points and offer solutions to improve project efficiency. It takes only 1–2 minutes to complete and you will receive personalized recommendations tailored to your needs.

🚀 Goals and Pain Points

What are your biggest obstacles today — and your goals for the next 6 months? We’ll use your answers to build a personalized roadmap.

Build your automation pipeline

 Understand and organize your data

Automate your key process

Define a digital strategy

Move from CAD (BIM) to databases and analytics

Combine BIM, ERP and Excel

Convince leadership to invest in data

📘  What to Read in Data-Driven Construction Guidebook

Chapters 1.2, 4.1–4.3 – Technologies, Data Conversion, Structuring, Modeling:

  • Centralized vs fragmented data

  • Principles of data structure

  • Roles of Excel, DWH, and databases

Chapters 5.2, 7.2 – QTO Automation, ETL with Python:

  • Data filtering and grouping

  • Automating QTO and quantity takeoff

  • Python scripts and ETL logic

Chapter 10.2 – Roadmap for Digital Transformation:

  • Strategic stages of digital change

  • Organizational setup

  • Prioritization and execution paths

Chapters 4.1, 8.1–8.2 – From CAD (BIM) to Storage & Analytics:

  • Translating Revit/IFC to structured tables

  • BIM as a database

  • Building analytical backends

Chapters 7.3, 10.2 – Building ETL Pipelines + Strategic Integration:

  • Combining Excel, BIM, ERP

  • Automating flows between tools

  • Connecting scattered data sources

Chapters 7.3, 7.4 – ETL Pipelines and Orchestration (Airflow, n8n):

  • Building pipelines

  • Scheduling jobs

  • Using tools like Airflow or n8n to control the flow 

Chapters 2.1, 10.1 – Fragmentation, ROI, Survival Strategy:

  • Hidden costs of bad data

  • Risk of inaction

  • ROI of data initiatives

  • Convincing stakeholders

Download the DDC Guidebook for Free

 

 

🎯 DDC Workshop That Solves Your Puzzle

Module 1 – Data Automation and Workflows in Construction:
  • Overview of data sources
  • Excel vs systems
  • Typical data flows in construction
  • Foundational data logic

Module 3 – Automated Data Processing Workflow:
  • Setting up ETL workflows
  • CAD/BIM extraction
  • Automation in Excel/PDF reporting

Module 8 – Converting Unstructured CAD into Structured Formats 
  • From IFC/Revit to tables
  • Geometric vs semantic data
  • Tools for parsing and transforming CAD models

Module 13 – Key Stages of Transformation 
  • Transformation roadmap
  • Change management
  • Roles and responsibilities
  • KPIs and success metrics

Module 8 – Integrating Diverse Data Systems and Formats
  • Excel, ERP, BIM integration
  • Data connection and file exchange
  • Structuring hybrid pipelines

Module 7 – Automating Data Quality Assurance Processes 
  • Rules and checks
  • Dashboards
  • Report validation
  • Automated exception handling

Module 10 – Challenges of Digitalization in the Industry 
  • How to justify investment in data
  • Stakeholder concerns
  • ROI examples
  • Failure risks

💬 Individual Consultation – What We'll Discuss

Audit of your data landscape 

We'll review how data is stored and shared in your company and identify key improvement areas.

Select a process for automation 

We'll pick one process in your company that can be automated and outline a step-by-step plan.

Strategic roadmap planning 

Together we’ll map your digital transformation priorities and build a realistic roadmap.

CAD (BIM) - IFC/Revit model review 

We'll review your Revit/IFC/DWG data and show how to convert it into clean, structured datasets.

Mapping integrations across tools 

We’ll identify your main data sources and define how they could be connected into one workflow.

Plan a pilot pipeline (PoC) 

We'll plan a pilot pipeline: where to start, what tools to use, and what benefits to expect.

ROI and stakeholder alignment 

📬 Get Your Personalized Report and Next Steps

You’ve just taken the first step toward clarity. But here’s the uncomfortable truth: 🚨 Most companies lose time and money every week because they don't know what their data is hiding. Missed deadlines, incorrect reports, disconnected teams — all symptoms of a silent data chaos that gets worse the longer it's ignored.

Please enter your contact details so we can send you your customized recommendations and next-step options tailored to your goals.

💡 What you’ll get next:

  • A tailored action plan based on your answers

  • A list of tools and strategies to fix what’s slowing you down

  • An invite to a free 1:1 session to discuss your case

  • And if you choose: a prototype (PoC) to show how your process could be automated — fast.

147 Questioning the feasibility of big data correlation, statistics and data sampling
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.
Read more
×