One of the most common tasks in construction projects is to process specifications in PDF format. To demonstrate the transition from unstructured data to a structured format, let’s consider a practical example: extracting a table from a PDF document and converting it to CSV or Excel format (Fig. 4.1-2).

LLM language models, such as ChatGPT, LlaMa, Mistral DeepSeek, Grok, Claude, QWEN greatly simplify the way data scientists work with data, reducing the need for deep learning of programming languages and allowing many tasks to be solved with text queries.
Therefore, instead of spending time searching for solutions on the Internet (usually the StackOverFlow website or thematic forums and chats) or contacting data processing specialists, we can use the capabilities of modern online or local LLMs. It is enough to ask a query and the model will provide ready code for converting a PDF -document into a tabular format.
- Send the following text request to any LLM -model (CHATGRT, LlaMa, Mistral DeepSeek, Grok, Claude, QWEN or any other):
Please write a code to extract text from a PDF -file that contains a table. The code should take the file path as an argument and return the extracted table as a DataFrame⏎
This code (Fig. 4.1-3) can be run offline in one of the popular IDEs we mentioned above: PyCharm, Visual Studio Code (VS Code), Jupyter Notebook, Spyder, Atom, Sublime Text, Eclipse with PyDev plugin, Thonny, Wing IDE, IntelliJ IDEA with Python plugin, JupyterLab or popular online tools: Kaggle.com, Google Collab, Microsoft Azure Notebooks, Amazon SageMaker.
- In the “Convert” step, we use the popular Pandas library (which we discussed in detail in the chapter “Python Pandas: an indispensable tool for working with data”) to read the extracted text into the DataFrame and save the DataFrame to a CSV table file or XLXS:
I need code that will convert the resulting table from a PDF -file to a DataFrame. Also add code to save the DataFrame to a CSV file ⏎
![]() |
If an error occurs when executing the code (Fig. 4.1-3, Fig. 4.1-4) – for example, due to missing libraries or wrong file path – the error text can simply be copied together with the source code and resubmitted to the LLM -model. The model will analyze the error message, explain what the problem is and suggest fixes or additional steps.
Thus, interaction with the AI LLM becomes a complete cycle: request → response → test → feedback → correction – without the need for deep technical knowledge.
Using a plain text query in LLM chat and a dozen lines of Python that we can run locally in any IDE, we converted a PDF -document into a tabular CSV format, which, unlike a PDF document, is easily machine readable and quickly integrated into any data management system.
We can apply this code (Fig. 4.1-3, Fig. 4.1-4), by copying it from any LLM chat room, to tens or thousands of new PDF documents on the server, thereby automating the process of converting a stream of unstructured documents into a structured CSV table format.
But PDF documents do not always contain text, more often than not they are scanned documents that need to be processed as images. Although images are inherently unstructured, the development and application of recognition libraries allow us to extract, process and analyze their content, enabling us to make full use of this data in business processes.