Converting unstructured data into structured form
19 February 2024Converting image into structured form
19 February 2024To show the process of converting unstructured data to a structured format, let's take a PDF document containing a table as an example, with the goal of converting the table from a PDF document to an Excel or CSV table format.
Automation and LLM language models, such as ChatGPT, can completely free data scientists from learning programming languages by providing the ability to solve such problems through text queries.
So instead of "googling" to find a solution we will ask ChatGPT to write for the code to convert a PDF document to a table.
❏ Text request to ChatGPT:
Please write code to extract text from a PDF file that has a table in it. The code should take the file path as an argument and returns the extracted table as a DataFrame. ⏎
➤ ChatGPT Answer:
This code can be run in one of the popular IDEs (integrated development environment) offline: PyCharm, Visual Studio Code (VS Code), Jupyter Notebook, Spyder, Atom, Sublime Text, Eclipse with PyDev plugin, Thonny, Wing IDE, IntelliJ IDEA with Python plugin, JupyterLab or popular online tools: Kaggle.com, Google Collab, Microsoft Azure Notebooks, Amazon SageMaker
❏ In the Transform step, we use the popular Pandas library to read the extracted text into a DataFrame table and finally save our DataFrame to a CSV table file:
I need code that will convert the resulting table from a PDF file to a DataFrame. Also add code to save the DataFrame to a CSV file. ⏎
➤ ChatGPT Answer:
Using plain text queries in ChatGPT and a dozen lines of Python and the Pandas library, we converted a PDF document into the CSV table format, which, unlike a PDF document, is widely distributed and easy to integrate into various data management systems. We can apply this code to tens and thousands of new PDF documents simultaneously, thus automating the process of converting a stream of unstructured documents into a structured table format.
PDF documents do not always contain text and more often they are scanned documents that we need to process as images. Although images are inherently unstructured, the development and application of recognition libraries allow us to extract, process and analyse their content, enabling us to fully exploit this data in business processes.