The Transform step is where the data is processed and transformed. This process may include correctness checking, normalization, filling in missing values and validation using automated tools
According to the PwC study “Data-Driven. What Students Need to Succeed in a Rapidly Changing Business World” (2015) (PWC, “Data driven What students need to succeed in a rapidly changing business world,” February 15, 2015), modern auditing companies are moving away from random data verification and are moving to analyzing massive amounts of information using automated tools. This approach allows not only to identify inconsistencies in reporting, but also to offer recommendations for optimizing business processes.
In construction, similar methods can be used, for example, for automatic validation of design data, construction quality control and contractor performance evaluation. One tool to automate and speed up data processing is the use of regular expressions (RegEx) in the data transformation phase (Transform) of the ETL process. RegEx allows you to efficiently validate data strings, detect inconsistencies and ensure the integrity of information with minimal resource consumption. We talked more about RegEx (Fig. 4.4-7) in the chapter “Translating Requirements into Structured Form”.
Let’s consider a practical example: in the real estate object management system (RPM) the manager sets the requirements to the key attributes of objects (Fig. 7.2-7). At the transformation stage, the following parameters need to be validated:
- checking of object identifier formats (attribute “ID”)
- control of replacement warranty period values (attribute “Warranty period”)
- verification of the element replacement cycle (attribute “Maintenance Requirements”)

To set boundary values for parameter validation, for example, suppose we know from our experience that valid values for the “ID” attribute can only include string values “W-NEW”, “W-OLD1” or “D-122” or similar values where the first character is a letter followed by a dash and then three alphabetic characters ‘NEW’, ‘OLD’ or any three-digit number (Fig. 7.2-7). The following regular expression (RegEx) can be used to validate these identifiers:
^W-NEW$|^W-OLD[0-9]+$|^D-1[0-9]{2}$
This template makes sure that all identifiers in the data match the specified criteria. If any value fails the check, the system logs an error. To create Python -code to transform the data and use the resulting data to create a results table, simply formulate a query in the LLM chat.
- Text request to LLM:
Write code to validate the columns of DataFrame using regular expressions, which checks identifiers in the format ‘W-NEW’ or ‘W-OLD’ via RegEx, energy efficiency with letters ‘A’ through ‘G’, warranty period and replacement cycle with numeric values in years ⏎
![]() |
The above, automatically generated Python -code (Fig. 7.2-8), uses the “re” library (RegEx regular expressions) to define a function that checks each attribute of a data item in a DataFrame. For each specified column (attribute), the function applies a RegEx pattern to check that each record conforms to the expected format and adds the results as new values (False/True) to the new attribute-column of the DataFrame.
Such automated validation ensures formal data compliance and can be used as part of a quality control system during the transformation phase.
After successful completion of the Transform step and quality check, the data is ready to be uploaded to target systems. Transformed and validated data can be uploaded to CSV, JSON, Excel, databases and other formats for further use. Depending on the task, the results can also be presented in reports, graphs or analytical dashboards.