SVM MODULE

Ex No                                                 Build SVM module

  CLICK HERE : PROGRAM

Aim

To build SVM (Support Vector Machine) module.

Algorithm

1.   Import necessary packages and libraries

2.   Load the dataset

3.   Load the algorithm Support Vector Machine and train the algorithm using the dataset

4.   Predict the category of new data

Let's go through the Python code step by step to understand what each part is doing.

Step-by-Step Explanation:

  1. Importing Required Libraries:

    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    from sklearn.svm import SVC
    from sklearn.metrics import confusion_matrix, accuracy_score
    from matplotlib.colors import ListedColormap
    
    • numpy: Used for numerical operations, such as creating arrays and performing mathematical operations.
    • matplotlib.pyplot: A plotting library used to visualize the results (in this case, decision boundaries and test set points).
    • pandas: Used to load and manipulate the dataset.
    • train_test_split: A utility from sklearn used to split the dataset into training and testing sets.
    • StandardScaler: Standardizes the features, ensuring they all have similar scales (important for SVMs).
    • SVC: Support Vector Classifier used to train the model (with a radial basis function (RBF) kernel).
    • confusion_matrix and accuracy_score: Used to evaluate the performance of the model.
    • ListedColormap: Used to define custom color schemes for plotting.
  2. Loading the Dataset:

    dataset = pd.read_csv('/content/gdrive/MyDrive/Dataset/Social_Network_Ads.csv')
    
    • This reads the CSV file into a pandas DataFrame. The dataset is assumed to be in your Google Drive and contains the necessary columns (Age, EstimatedSalary, Purchased).
  3. Selecting Features and Target Variable:

    X = dataset.iloc[:, [2, 3]].values  # Features: Age and Estimated Salary
    y = dataset.iloc[:, 4].values       # Target: Purchased (0 or 1)
    
    • X contains the features (Age and EstimatedSalary), which are at columns 2 and 3 of the dataset.
    • y is the target variable (Purchased), which is at column 4 (binary classification: 0 = did not purchase, 1 = purchased).
  4. Splitting the Dataset into Training and Testing Sets:

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, 
    • The dataset is split into training and testing sets (25% of the data is used for testing).
    • random_state=0 ensures that the split is reproducible.
  5. random_state=0)
    
  6. Feature Scaling (Standardization):

    sc = StandardScaler()
    X_train = sc.fit_transform(X_train)
    X_test = sc.transform(X_test)
    
    • StandardScaler() is used to standardize the features so that they have a mean of 0 and a standard deviation of 1.
    • This is important for SVM, as it is sensitive to the scale of the features. The fit_transform() method standardizes the training data, and transform() is applied to the test data (using the same scaling parameters).
  7. Training the SVM Model:

    classifier = SVC(kernel='rbf', random_state=0)
    classifier.fit(X_train, y_train)
    
    • We create an SVM classifier with a Radial Basis Function (RBF) kernel. The kernel='rbf' option tells the SVM to use the RBF kernel.
    • The model is then trained using the training data (X_train, y_train).
  8. Making Predictions on the Test Set:

    y_pred = classifier.predict(X_test)
    
    • The trained model is used to make predictions on the test data (X_test), and the predicted values are stored in y_pred.
  9. Evaluating the Model:

    cm = confusion_matrix(y_test, y_pred)
    accuracy = accuracy_score(y_test, y_pred)
    print("Confusion Matrix:")
    print(cm)
    print("Accuracy Score:", accuracy)
    
    • A confusion matrix is calculated using confusion_matrix(), which shows the number of true positives, true negatives, false positives, and false negatives.
    • The accuracy score is also computed using accuracy_score(), which measures the proportion of correctly classified instances.
  10. Visualizing the Decision Boundary:

    X_set, y_set = X_test, y_test
    X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, 
    • The code creates a grid of values that covers the feature space for X_test (i.e., Age and EstimatedSalary), using np.meshgrid().
    • X1 and X2 represent the 2D grid of coordinates that will be used to visualize the decision boundary.
    plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), 
                 alpha=0.75, cmap=ListedColormap(('red', 'green')))
    
    • The plt.contourf() function is used to plot the decision boundary. The SVM model is used to predict the class for every point in the grid, and the decision boundary is drawn by color-coding the regions (red or green).
    • alpha=0.75 makes the contour plot semi-transparent, so the actual data points can be overlaid.
    plt.xlim(X1.min(), X1.max())
    plt.ylim(X2.min(), X2.max())
    
    • These lines ensure that the plot covers the entire range of the test set features.
    for i, j in enumerate(np.unique(y_set)):
        plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], 
    • This loop plots the actual test set points (X_test), with different colors for each class (0 or 1).
    • np.unique(y_set) ensures that both classes (0 and 1) are displayed in the plot.
    plt.title('SVM (Test set)')
    plt.xlabel('Age')
    plt.ylabel('Estimated Salary')
    plt.legend()
    plt.show()
    
    • These lines add a title, labels, and a legend to the plot, and finally, plt.show() displays the plot.
  11. c=ListedColormap(('red', 'green'))(i), label=j)
    
  12. stop=X_set[:, 0].max() + 1, step=0.01),
                         np.arange(start=X_set[:, 1].min() - 1, 
  13. stop=X_set[:, 1].max() + 1, step=0.01))
    

Key Concepts:

  • SVM (Support Vector Machine): A supervised learning algorithm that finds the optimal hyperplane to separate classes in high-dimensional space. For this case, we used an RBF kernel to handle non-linear data.
  • Feature Scaling: Important for SVMs to perform well, as they are sensitive to the scale of the features.
  • Confusion Matrix: A useful tool for evaluating classification performance, showing the number of correct and incorrect predictions.
  • Decision Boundary: Visualizing the decision boundary helps understand how the model is making decisions based on the input features.

What You'll See:

  • Confusion Matrix: A matrix that shows the true positives, true negatives, false positives, and false negatives.
  • Accuracy Score: The overall percentage of correctly classified instances in the test set.
  • Decision Boundary Plot: A 2D plot where the regions are colored according to the predicted class (red or green), and the actual data points are overlaid.

This should give you a good understanding of how the SVM model works for this particular dataset. Let me know if you have any questions!

No comments:

Post a Comment