Intro To Machine Learning w/ Python for Beginners

Machine learning is a subset of artificial intelligence that involves the use of algorithms to learn patterns from data and make predictions or decisions based on the learned patterns. Python is a popular programming language for machine learning due to its simplicity, readability, and numerous libraries available to use. In this blog post, we will explore an introduction to machine learning using sklearn in Python3. You can either open up your favorite code editor like Visual Studio Code, or use Google’s online editor: https://colab.research.google.com/.

What is sklearn?

scikit-learn, also known as sklearn, is a machine learning library for Python that provides efficient tools for data mining and data analysis. It is built on top of NumPy, SciPy, and matplotlib libraries and is open-source. It features various classification, regression, and clustering algorithms, and it also supports preprocessing, model selection, and dimensionality reduction techniques.

Installation

Before we begin, we need to make sure that we have sklearn installed. If you are using Colab, sklearn is likely already installed. However, if you are using pip, you can install it using the following command:

pip install scikit-learn

Once installed, we can begin our introduction to machine learning using sklearn in Python3.

Loading Data

The first step in any machine learning project is to load the data. sklearn provides various datasets to use for practice, but we can also load our own data. In this example, we will be using the famous iris dataset, which can be loaded using the following code:

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

Here, X is the feature matrix that contains the input variables, and y is the target vector that contains the output variable.

Preprocessing

After loading the data, we need to preprocess it. Preprocessing involves transforming the data to make it ready for machine learning algorithms. In this example, we will use the StandardScaler to standardize the features and make sure they are on the same scale.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X = scaler.fit_transform(X)

Splitting the Data

Before we can train our model, we need to split the data into training and testing sets. This is done to evaluate the performance of the model on unseen data. We will use the train_test_split function to split the data.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Here, we split the data into 80% for training and 20% for testing.

Training the Model

We can now train our machine learning model. In this example, we will use the k-nearest neighbors (KNN) algorithm to predict the species of iris based on its features.

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

Here, we create an instance of the KNeighborsClassifier class with n_neighbors=3, which means that we will consider the 3 nearest neighbors to make a prediction.

Making Predictions

We can now use our trained model to make predictions on the testing data.

y_pred = knn.predict(X_test)

Evaluating the Model

Finally, we need to evaluate the performance of our model. We will use the accuracy score, which is the percentage of correct predictions.

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))

Conclusion

In conclusion, we built a K-Nearest Neighbors Model with our example IRIS dataset in order to be able to classify new data. Using the predict method, we are able to predict on different datasets. Evaluation of the model is important as well, as it can tell you how good your model is at classifying new data.