For classification tasks in Jupyter Notebook, you may choose Google Colab, Anaconda, or VS Code. In Colab, create a Python 3 notebook. In Anaconda, download, create, and activate a Python environment. And in VSCode, use the Python extension and integrated terminal. Please install packages like scikit-learn, pandas, and numpy in advance then code in your Jupyter Notebook.
● Logistic Regression
To prepare the data, we utilize the scikit-learn library for data handling and Matplotlib for visualization.
The code below generates synthetic data for two distinct groups with different characteristics. The data is then divided into training and testing sets to evaluate model performance. To ensure accurate modeling, feature standardization is applied, which adjusts the data to have a mean of 0 and a standard deviation of 1. The code concludes with a scatter plot, using 'x' markers for training data and 'o' markers for testing data.
Then we could see data as follows. Obviously, the data we created could be applied by a binary classification.
Hence, we run the code below.
In this code, we employ logistic regression, a machine learning technique for binary classification. We train the model on the standardized training data (X_train_std) and their labels (y_train). Afterward, we use the trained model to predict the classes of test data (X_test_std) and measure the accuracy of these predictions.
First, import the required libraries for svm.
We use the data in sklearn library in the next code and see what the data looks like.
Then we get the following results. Our dataset is structured as a list of lists. Each inner list contains a set of values, and these values represent pixel measurements. Each of these inner lists constitutes a unique computer visual observation. They are handwritten numbers!
Since there are ten numbers, which mean ten classes in total, this becomes a multiple classification problem. Then we apply supporter vector machine to train and test our data.
Now let us see the result. Here is a classification report. For each class, we get precision, recall, f1-score and support four values, which are calculated as the ratio of true positive (TP) predictions to the total positive predictions (TP + false positive (FP)), the ratio of true positive (TP) predictions to the total actual positive cases (TP + false negative (FN)), the harmonic mean of precision and recall and the number of actual occurrences of each class in the dataset respectively.
Here is the confusion matrix. The horizontal axis represents predicted values, while in the vertical axis actual values are indicated. For example, the first row, we predicted 69 times for the true 0 (correctly predicted), 1 time as 4 for the true 0(wrongly predicted).
First we generate a dataset as follows.
The next code perform the initialization step of the k-means algorithm, where it selects initial cluster centers based on probabilities derived from the distances of data points to the existing centers. It's a crucial step in the k-means algorithm to start clustering data points efficiently. The result is visualized with a scatter plot showing the data points and the initial cluster centers. The subsequent steps of the k-means algorithm would involve iteratively assigning data points to the nearest cluster and updating the cluster centers until convergence.
Please feel free to contact if you have any comments.
Stay tuned for the next part, coming next month!