One of the most famous examples of using ML in data analytics is the analysis of the Titanic dataset, which is often used to study the probability of survival of passengers. Studying this table is analogous to the “Hello World” program when learning programming languages.
The sinking of the RMS Titanic in 1912 resulted in the deaths of 1502 out of 2224 people. The Titanic dataset contains not only information about whether a passenger survived, but also attributes such as: age, gender, ticket class and other parameters. This dataset is available for free and can be opened and analyzed on various offline and online platforms.
Link to Titanic dataset:
https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv
Earlier in the chapter “LLM-enabled IDEs and future programming changes” we already discussed Jupyter Notebook – one of the most popular development environments for data analysis and machine learning. Free cloud-based counterparts to Jupyter Notebook are the Kaggle and Google Collab platforms, which allow you to run Python code without installing software and provide free access to computing resources.
Kaggle is the largest data analytics, machine learning competition platform with a built-in code execution environment. As of October 2023, Kaggle has over 15 million users (Wikipedia, “Kaggle,” 1 January 2025)from 194 countries.
Download and use the Titanic dataset on the Kaggle platform (Fig. 9.2-5) to store the dataset (a copy of it) and run Python code with pre-installed libraries directly in a browser, without having to install a dedicated IDE.

The Titanic dataset includes data on the 2224 passengers on board the RMS Titanic at the time of its wreck in 1912. The dataset is presented as two separate tables, a training (train.csv) and a test (test.csv) sample, allowing it to be used both for training models and for evaluating their accuracy on new data.
The training dataset contains both attributes-attributes of passengers (age, gender, ticket class and others) and information about who survived (column with binary values “Survived”). The training dataset (Fig. 9.2-6 – file train.csv) is used to train the model. The test dataset (Fig. 9.2-7 – file test.csv) includes only passenger attributes without survivor information (without a single “Survivor” column). The test dataset is designed to test the model on new data and to evaluate its accuracy.
Thus, we have almost identical attributes of passengers in the training and test datasets. The only key difference is that in the test dataset we have a list of passengers who do not have the “Survivor” column – the target variable, which we want to learn to predict using various mathematical algorithms. And after building the model, we will be able to compare the output of our model with the real parameter “Survivor” from the test dataset, which we will take into account to evaluate the results.
The main columns of the table, passenger parameters in the training and test dataset:
- PassengerId – unique passenger identifier
- Survived – 1 if the passenger survived, 0 if dead (not available in the test set)
- Pclass – ticket class (1, 2 or 3)
- Name – passenger’s name
- Sex – sex of the passenger (male/female)
- Age
- SibSp – number of brothers/sisters or spouses on board
- Parch – number of parents or children on board
- Ticket – ticket number
- Fare is the cost of a ticket
- Cabin – cabin number (many data are missing)
- Embarked – port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
To visualize missing data in both tables, you can use the missingno library (Fig. 9.2-6, Fig. 9.2-7), which displays missing values in the form of a histogram, where white fields show missing data. This visualization allows a quick assessment of data quality before processing.


Before formulating hypotheses and making predictions based on the dataset, visual analysis helps to identify key patterns in the data, assess its quality, and identify possible dependencies. There are many visualization techniques that help you better understand the Titanic dataset. You can use distribution plots to analyze passenger age groups, survival charts by gender and class, and missing data matrices to assess the quality of information and understand the data.
- Let’s ask LLM to help us visualize the data from the Titanic dataset by sending the following text request to any LLM model (CHATGP, LlaMa, Mistral DeepSeek, Grok, Claude, QWEN or any other):
Please show some simple graphs for the Titanic dataset. Download the dataset yourself and show the ⏎
- LLM response in the form of ready-made code and graphs visualizing the dataset parameters:


Data visualization is an important step to prepare the dataset for the subsequent construction of a machine learning model, which can only be accessed by understanding the data.