Within construction projects, the main sources of unstructured data are technical documents, progress reports, plans, drawings and schematics. The importance of these data lies in their content, which may include specifications, schedules, approval protocols, and more. The variety of formats and structures of this data makes it difficult to organize, analyze, and use in other systems and applications.
The conversion process can vary depending on the type of input data and the desired processing results.
Moving from unstructured to structured form is both an art and a science, a job that sometimes takes up to much of a data engineer's and analyst's work in order to produce a clean, organized data set that reflects the intricacies of the data.
Converting unstructured data into a structured format is a step-by-step process that includes the following steps:
- Data extraction (Extract stage): This step loads a source document or image that contains unstructured data. This can be a PDF document, a photo, a drawing or a schematic.
- Data transformation (Transform stage): This is followed by the step of transforming unstructured data into a structured format. For example, this may involve recognizing and interpreting text from images using Optical Character Recognition (OCR) or other processing techniques.
- Data loading and saving (Load stage): The last stage involves saving the processed data into various formats such as CSV, XLSX, XML, JSON for further work, where the choice of format depends on the specific requirements and preferences of the project.
The conversion of unstructured data into a structured format, which consists of Extract, Transform and Load, is called ETL, which we will discuss in more detail in the chapter "ETL and Pipeline: Extract, Transform, Load". Let's see in practice how documents of different formats are converted into structured formats.
To show the process of converting unstructured data to a structured format, let's take a PDF document containing a table as an example, with the goal of converting the table from a PDF document to an Excel or CSV table format.
Automation and LLM language models, such as ChatGPT, can completely free data scientists from learning programming languages by providing the ability to solve such problems through text queries.
So instead of "googling" to find a solution we will ask ChatGPT to write for the code to convert a PDF document to a table.
❏ Text request to ChatGPT:
Please write code to extract text from a PDF file that has a table in it. The code should take the file path as an argument and returns the extracted table as a DataFrame. ⏎
➤ ChatGPT Answer:
This code can be run in one of the popular IDEs (integrated development environment) offline: PyCharm, Visual Studio Code (VS Code), Jupyter Notebook, Spyder, Atom, Sublime Text, Eclipse with PyDev plugin, Thonny, Wing IDE, IntelliJ IDEA with Python plugin, JupyterLab or popular online tools: Kaggle.com, Google Collab, Microsoft Azure Notebooks, Amazon SageMaker
❏ In the Transform step, we use the popular Pandas library to read the extracted text into a DataFrame table and finally save our DataFrame to a CSV table file:
I need code that will convert the resulting table from a PDF file to a DataFrame. Also add code to save the DataFrame to a CSV file. ⏎
➤ ChatGPT Answer:
Using plain text queries in ChatGPT and a dozen lines of Python and the Pandas library, we converted a PDF document into the CSV table format, which, unlike a PDF document, is widely distributed and easy to integrate into various data management systems. We can apply this code to tens and thousands of new PDF documents simultaneously, thus automating the process of converting a stream of unstructured documents into a structured table format.
PDF documents do not always contain text and more often they are scanned documents that we need to process as images. Although images are inherently unstructured, the development and application of recognition libraries allow us to extract, process and analyse their content, enabling us to fully exploit this data in business processes.