Just for learning, this time we will try do make the solution out of the problem in this Kaggle’s Challenge (https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/)

In this challenge we will be given a training data contains 8 columns. The table will looks like this one below.

Column id is the id every comment in column of comment_text. comment_text is String. In the rest of the columns we have binary data which shows whether the String from comment_text contains certain kind of “toxic” comment or not. 1 for having, and 0 for not having it.

Our testing data will contain only id and comment_text. From that testing data, we have to predict the probability of the comment_text on every class column. So basically, this work just predicting the probability of a sentence on different 6 classes. Lets get start.

  1. We are going to import all of the package we need
    import pandas as pd
    import numpy as np
    
    from keras.preprocessing.text import Tokenizer
    from keras.models import Sequential
    from keras.layers import Dense, LSTM, Flatten
  2. Open training file and testing file using pandas
    train = pd.read_csv(r'C:\Users\Muhammad Fhadli\Documents\Spyder\Jigsaw\Data\train.csv')
    test = pd.read_csv(r'C:\Users\Muhammad Fhadli\Documents\Spyder\Jigsaw\Data\test.csv')
  3. To make it easy for work, we create a separate label file for the COMMENT (X) and label (Y)
    label = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
    COMMENT = 'comment_text'
  4. Add this function so we can fill empty value with “unknow” and turn training data into list
    def get_list(data):
        data = data.fillna("unknown")
        return data.tolist()
  5. Add this function to tokenize our data, turn it into matrix, and reshape it into (data size, number of time step, feature of the data)
    def tokenize_reshape(data):
        data = tok.texts_to_matrix(data)
        return np.reshape(data, (data.shape[0], 1, data.shape[1]))
  6. Apply function get_list into training and testing data
    x_train = get_list(train[COMMENT])
    x_test = get_list(test[COMMENT])
  7. Create object from Tokenizer class. We use 1000 words for our vocabulary in this experiment. After we create the object, we fit it into training data. This is because we want our training data to become the source of the vocabulary.
    tok = Tokenizer(num_words=1000)
    tok.fit_on_texts(x_train)
  8. We apply function tokenize_reshape into training and testing data.
    x_train = tokenize_reshape(x_train)
    x_test = tokenize_reshape(x_test)
  9. Create the model. This model has 200 unit of LSTM in the first layer. 200 is something that LSTM handle inside it. So, you don’t have to worry about it. The shape of our input data is (159571, 1, 1000). 159571 is the number of instances in our training data, we treated 1 instance as one time step in this case. 1000 is the size of sequence, because we have sequence with 1000 sequence number, came from various tokens that we fitted in step 7. We put return_sequences=True so the output dimension of our LSTM will be the same as our input dimension
    model = Sequential()
    model.add(LSTM(200, input_shape=(1, x_train.shape[2]), return_sequences=True))
    model.add(LSTM(200, return_sequences=True))
    model.add(Flatten())
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  10. In the second layer, we put similar LSTM just so our model can understand the data better. You can try as many as LSTM layer you want, just for learn.
  11. The output of LSTM layer in this case will be 3 Dimension vector. To make it 2 Dimension vector, we add Flatten()
  12. The final output that we want is 1 dimensional vector with value of 0 and 1. So, we put Dense in the last layer and use sigmoid as our activation function.
  13. Here is the summary of our model
  14. Create 2 Dimension list to store our output
    result = np.zeros((len(x_test),len(label)))
  15. In every iteration, we take the value of each class. Fit the model into that value, and doing probability prediction, we store the probability inside
    for i, j in enumerate(label):
        print("Train ",j)
        y_train = train[j].values
        model.fit(x_train, y_train, epochs=10)
        result[:,i] = model.predict_proba(x_test)[:,0]
  16. The last, we create the Data Frame of submission containing the id of every test set. Convert result from list into Data Frame. Add result into submission, and save submission as csv file.
    for i, j in enumerate(label):
        print("Train ",j)
        y_train = train[j].values
        model.fit(x_train, y_train, epochs=10)
        result[:,i] = model.predict_proba(x_test)[:,0]
  17. Our final result will look list the picture below

You can see the full code in this link (https://github.com/muhammadfhadli1453/Toxic-Comment-Classification-Using-LSTM)

If you have any question or suggestion, please feel free to give a comment. That is all for today. Thank you.

 

Categories: Python

Leave a Reply

Your email address will not be published. Required fields are marked *