Converting unstructured data into structured form

ChatGPT and LLM to automate data processes

18 February 2024

Example of converting a PDF document to a table

19 February 2024

Converting unstructured data into structured form

Tags

Within construction projects, the main sources of unstructured data are technical documents, progress reports, plans, drawings and schematics. The importance of these data lies in their content, which may include specifications, schedules, approval protocols, and more. The variety of formats and structures of this data makes it difficult to organize, analyze, and use in other systems and applications.

The conversion process can vary depending on the type of input data and the desired processing results.

Moving from unstructured to structured form is both an art and a science, a job that sometimes takes up to much of a data engineer's and analyst's work in order to produce a clean, organized data set that reflects the intricacies of the data.

One of the popular tasks of a data engineer is to convert multiformat data into structured form

Converting unstructured data into a structured format is a step-by-step process that includes the following steps:

Data extraction (Extract stage): This step loads a source document or image that contains unstructured data. This can be a PDF document, a photo, a drawing or a schematic.
Data transformation (Transform stage): This is followed by the step of transforming unstructured data into a structured format. For example, this may involve recognizing and interpreting text from images using Optical Character Recognition (OCR) or other processing techniques.
Data loading and saving (Load stage): The last stage involves saving the processed data into various formats such as CSV, XLSX, XML, JSON for further work, where the choice of format depends on the specific requirements and preferences of the project.

The conversion of unstructured data into a structured format, which consists of Extract, Transform and Load, is called ETL, which we will discuss in more detail in the chapter "ETL and Pipeline: Extract, Transform, Load". Let's see in practice how documents of different formats are converted into structured formats.

ChatGPT and LLM to automate data processes

Example of converting a PDF document to a table

ChatGPT and LLM to automate data processes

Example of converting a PDF document to a table

datadrivenconstruction.io

Related posts

Translation of CAD (BIM) data into a structured form

Converting image into structured form

Example of converting a PDF document to a table

Don't miss the new solutions

Looking for the Linux or MAC version? Send us a quick message using the button below, and we’ll guide you through the process!

📥 Download OnePager

Reserve your spot now to rethink your approach to decision making!

🧰 Data-Driven Readiness Check

🚀 Goals and Pain Points

Build your automation pipeline

Understand and organize your data

Automate your key process

Define a digital strategy

Move from CAD (BIM) to databases and analytics

Combine BIM, ERP and Excel

Convince leadership to invest in data

📘 What to Read in Data-Driven Construction Guidebook

Chapters 1.2, 4.1–4.3 – Technologies, Data Conversion, Structuring, Modeling:

Centralized vs fragmented data

Principles of data structure

Roles of Excel, DWH, and databases

Chapters 5.2, 7.2 – QTO Automation, ETL with Python:

Data filtering and grouping

Automating QTO and quantity takeoff

Python scripts and ETL logic

Chapter 10.2 – Roadmap for Digital Transformation:

Strategic stages of digital change

Organizational setup

Prioritization and execution paths

Chapters 4.1, 8.1–8.2 – From CAD (BIM) to Storage & Analytics:

Translating Revit/IFC to structured tables

BIM as a database

Building analytical backends

Chapters 7.3, 10.2 – Building ETL Pipelines + Strategic Integration:

Combining Excel, BIM, ERP

Automating flows between tools

Connecting scattered data sources

Chapters 7.3, 7.4 – ETL Pipelines and Orchestration (Airflow, n8n):

Building pipelines

Scheduling jobs

Using tools like Airflow or n8n to control the flow

Chapters 2.1, 10.1 – Fragmentation, ROI, Survival Strategy:

Hidden costs of bad data

Risk of inaction

ROI of data initiatives

Convincing stakeholders

🎯 DDC Workshop That Solves Your Puzzle

Module 1 – Data Automation and Workflows in Construction:Overview of data sourcesExcel vs systemsTypical data flows in constructionFoundational data logic

Module 3 – Automated Data Processing Workflow:Setting up ETL workflowsCAD/BIM extractionAutomation in Excel/PDF reporting

Module 8 – Converting Unstructured CAD into Structured Formats From IFC/Revit to tablesGeometric vs semantic dataTools for parsing and transforming CAD models

Module 13 – Key Stages of Transformation Transformation roadmapChange managementRoles and responsibilitiesKPIs and success metrics

Module 8 – Integrating Diverse Data Systems and FormatsExcel, ERP, BIM integrationData connection and file exchangeStructuring hybrid pipelines

Module 7 – Automating Data Quality Assurance Processes Rules and checksDashboardsReport validationAutomated exception handling

Module 10 – Challenges of Digitalization in the Industry How to justify investment in dataStakeholder concernsROI examplesFailure risks

💬 Individual Consultation – What We'll Discuss

Audit of your data landscape

We'll review how data is stored and shared in your company and identify key improvement areas.

Select a process for automation

We'll pick one process in your company that can be automated and outline a step-by-step plan.

Strategic roadmap planning

Together we’ll map your digital transformation priorities and build a realistic roadmap.

CAD (BIM) - IFC/Revit model review

We'll review your Revit/IFC/DWG data and show how to convert it into clean, structured datasets.

Mapping integrations across tools

We’ll identify your main data sources and define how they could be connected into one workflow.

Plan a pilot pipeline (PoC)

We'll plan a pilot pipeline: where to start, what tools to use, and what benefits to expect.

ROI and stakeholder alignment

We’ll discuss how to justify data investments with ROI examples and stakeholder alignment tips.

📬 Get Your Personalized Report and Next Steps

Please enter your contact details so we can send you your customized recommendations and next-step options tailored to your goals.

💡 What you’ll get next:

A tailored action plan based on your answers

A list of tools and strategies to fix what’s slowing you down

Reserve your spot now to rethink your
approach to decision making!

Module 1 – Data Automation and Workflows in Construction:
Overview of data sources
Excel vs systems
Typical data flows in construction
Foundational data logic

Module 3 – Automated Data Processing Workflow:
Setting up ETL workflows
CAD/BIM extraction
Automation in Excel/PDF reporting

Module 8 – Converting Unstructured CAD into Structured Formats
From IFC/Revit to tables
Geometric vs semantic data
Tools for parsing and transforming CAD models

Module 13 – Key Stages of Transformation
Transformation roadmap
Change management
Roles and responsibilities
KPIs and success metrics

Module 8 – Integrating Diverse Data Systems and Formats
Excel, ERP, BIM integration
Data connection and file exchange
Structuring hybrid pipelines

Module 7 – Automating Data Quality Assurance Processes
Rules and checks
Dashboards
Report validation
Automated exception handling

Module 10 – Challenges of Digitalization in the Industry
How to justify investment in data
Stakeholder concerns
ROI examples
Failure risks