ChatGPT and LLM to automate data processes
18 February 2024Example of converting a PDF document to a table
19 February 2024Within construction projects, the main sources of unstructured data are technical documents, progress reports, plans, drawings and schematics. The importance of these data lies in their content, which may include specifications, schedules, approval protocols, and more. The variety of formats and structures of this data makes it difficult to organize, analyze, and use in other systems and applications.
The conversion process can vary depending on the type of input data and the desired processing results.
Moving from unstructured to structured form is both an art and a science, a job that sometimes takes up to much of a data engineer's and analyst's work in order to produce a clean, organized data set that reflects the intricacies of the data.
Converting unstructured data into a structured format is a step-by-step process that includes the following steps:
- Data extraction (Extract stage): This step loads a source document or image that contains unstructured data. This can be a PDF document, a photo, a drawing or a schematic.
- Data transformation (Transform stage): This is followed by the step of transforming unstructured data into a structured format. For example, this may involve recognizing and interpreting text from images using Optical Character Recognition (OCR) or other processing techniques.
- Data loading and saving (Load stage): The last stage involves saving the processed data into various formats such as CSV, XLSX, XML, JSON for further work, where the choice of format depends on the specific requirements and preferences of the project.
The conversion of unstructured data into a structured format, which consists of Extract, Transform and Load, is called ETL, which we will discuss in more detail in the chapter "ETL and Pipeline: Extract, Transform, Load". Let's see in practice how documents of different formats are converted into structured formats.