Machine Learning in One hour. Predicting drug addiction Risk.

Last week I came across a Machine Learning hackathon quite randomly and I didn’t write for a lot of days. So I feel like I have both the substance and energy to write up something today.

The problem is quite simple, but rather poorly designed (The Hackerearth challenge simple sliced data from a past kaggle dataset and started the hackathon). But I started the problem in good faith and worked pretty much from scratch.

For the busy people kernel is here in kaggle scripts.

Input Features:

Year, Location, The question that was asked in the survey, answer, race, grade, gender, and 3 strategies applied on them to reduce risk. Something like below,

 

Output:

The percentage of risk (Greater_Risk_Probability column)

Solution Approach:
  1. We keep the numerical variable as-is.
  2. Next, Convert StratificationTyp, sex , QuestionCode to label encoded integer.
  3. Create some extra features:
    • asked_about_marijuana (from Greater_Risk_Question)
    • asked_about_alcohol (from Greater_Risk_Question)
    • asked_about_heroin (from Greater_Risk_Question)
    • injected_something (from Greater_Risk_Question)
    • inhaled_something (from Greater_Risk_Question)
    • asked_about_cocaine (from Greater_Risk_Question)
    • is_hispanic (from Race)
    • is_asian (from Race)
    • is_native (from Race)
    • is_black (from Race)
    • is_white (from Race)
    • is_native (from Race)
  4. Drop everything else
  5. Finally, fit the whole data into tuned LGBM (BAM! we got rmse less than 0.05)output from lgbm model

 

Notes:

  • The idea of creating features like `asked_about_marijuana` is simple. It’s hard to process the column Greater_Risk_Question sensibly in it’s text from, but logically all the features derived from Greater_Risk_Question might just capture the essence of the variable. And of course those extra features are easy to create.
    This simple code generates all the extra features I’m gonna use in model.
  • The same way we can generate all the features from race column.
  • This is absolutely a basic model. Lot of scope for further scope of improvements e.g. zero centering the year column might be useful.

Leave a Reply

Your email address will not be published.