Ex No Build SVM module
Aim
To
build SVM (Support Vector Machine) module.
Algorithm
1. Import
necessary packages and libraries
2. Load
the dataset
3. Load
the algorithm Support Vector Machine and train the algorithm using the dataset
4. Predict
the category of new data
Let's go through the Python code step by step to understand what each part is doing.
Step-by-Step Explanation:
-
Importing Required Libraries:
import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC from sklearn.metrics import confusion_matrix, accuracy_score from matplotlib.colors import ListedColormap
numpy
: Used for numerical operations, such as creating arrays and performing mathematical operations.matplotlib.pyplot
: A plotting library used to visualize the results (in this case, decision boundaries and test set points).pandas
: Used to load and manipulate the dataset.train_test_split
: A utility fromsklearn
used to split the dataset into training and testing sets.StandardScaler
: Standardizes the features, ensuring they all have similar scales (important for SVMs).SVC
: Support Vector Classifier used to train the model (with a radial basis function (RBF) kernel).confusion_matrix
andaccuracy_score
: Used to evaluate the performance of the model.ListedColormap
: Used to define custom color schemes for plotting.
-
Loading the Dataset:
dataset = pd.read_csv('/content/gdrive/MyDrive/Dataset/Social_Network_Ads.csv')
- This reads the CSV file into a pandas DataFrame. The dataset is assumed to be in your Google Drive and contains the necessary columns (
Age
,EstimatedSalary
,Purchased
).
- This reads the CSV file into a pandas DataFrame. The dataset is assumed to be in your Google Drive and contains the necessary columns (
-
Selecting Features and Target Variable:
X = dataset.iloc[:, [2, 3]].values # Features: Age and Estimated Salary y = dataset.iloc[:, 4].values # Target: Purchased (0 or 1)
X
contains the features (Age
andEstimatedSalary
), which are at columns 2 and 3 of the dataset.y
is the target variable (Purchased
), which is at column 4 (binary classification: 0 = did not purchase, 1 = purchased).
-
Splitting the Dataset into Training and Testing Sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
- The dataset is split into training and testing sets (25% of the data is used for testing).
random_state=0
ensures that the split is reproducible.
random_state=0)
-
Feature Scaling (Standardization):
sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test)
StandardScaler()
is used to standardize the features so that they have a mean of 0 and a standard deviation of 1.- This is important for SVM, as it is sensitive to the scale of the features. The
fit_transform()
method standardizes the training data, andtransform()
is applied to the test data (using the same scaling parameters).
-
Training the SVM Model:
classifier = SVC(kernel='rbf', random_state=0) classifier.fit(X_train, y_train)
- We create an SVM classifier with a Radial Basis Function (RBF) kernel. The
kernel='rbf'
option tells the SVM to use the RBF kernel. - The model is then trained using the training data (
X_train
,y_train
).
- We create an SVM classifier with a Radial Basis Function (RBF) kernel. The
-
Making Predictions on the Test Set:
y_pred = classifier.predict(X_test)
- The trained model is used to make predictions on the test data (
X_test
), and the predicted values are stored iny_pred
.
- The trained model is used to make predictions on the test data (
-
Evaluating the Model:
cm = confusion_matrix(y_test, y_pred) accuracy = accuracy_score(y_test, y_pred) print("Confusion Matrix:") print(cm) print("Accuracy Score:", accuracy)
- A confusion matrix is calculated using
confusion_matrix()
, which shows the number of true positives, true negatives, false positives, and false negatives. - The accuracy score is also computed using
accuracy_score()
, which measures the proportion of correctly classified instances.
- A confusion matrix is calculated using
-
Visualizing the Decision Boundary:
X_set, y_set = X_test, y_test X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1,
- The code creates a grid of values that covers the feature space for
X_test
(i.e.,Age
andEstimatedSalary
), usingnp.meshgrid()
. X1
andX2
represent the 2D grid of coordinates that will be used to visualize the decision boundary.
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha=0.75, cmap=ListedColormap(('red', 'green')))
- The
plt.contourf()
function is used to plot the decision boundary. The SVM model is used to predict the class for every point in the grid, and the decision boundary is drawn by color-coding the regions (red or green). alpha=0.75
makes the contour plot semi-transparent, so the actual data points can be overlaid.
plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max())
- These lines ensure that the plot covers the entire range of the test set features.
for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
- This loop plots the actual test set points (
X_test
), with different colors for each class (0 or 1). np.unique(y_set)
ensures that both classes (0 and 1) are displayed in the plot.
plt.title('SVM (Test set)') plt.xlabel('Age') plt.ylabel('Estimated Salary') plt.legend() plt.show()
- These lines add a title, labels, and a legend to the plot, and finally,
plt.show()
displays the plot.
- The code creates a grid of values that covers the feature space for
c=ListedColormap(('red', 'green'))(i), label=j)
stop=X_set[:, 0].max() + 1, step=0.01), np.arange(start=X_set[:, 1].min() - 1,
stop=X_set[:, 1].max() + 1, step=0.01))
Key Concepts:
- SVM (Support Vector Machine): A supervised learning algorithm that finds the optimal hyperplane to separate classes in high-dimensional space. For this case, we used an RBF kernel to handle non-linear data.
- Feature Scaling: Important for SVMs to perform well, as they are sensitive to the scale of the features.
- Confusion Matrix: A useful tool for evaluating classification performance, showing the number of correct and incorrect predictions.
- Decision Boundary: Visualizing the decision boundary helps understand how the model is making decisions based on the input features.
What You'll See:
- Confusion Matrix: A matrix that shows the true positives, true negatives, false positives, and false negatives.
- Accuracy Score: The overall percentage of correctly classified instances in the test set.
- Decision Boundary Plot: A 2D plot where the regions are colored according to the predicted class (red or green), and the actual data points are overlaid.
This should give you a good understanding of how the SVM model works for this particular dataset. Let me know if you have any questions!
No comments:
Post a Comment