This time BCGX DS Intern OA is obviously closer to the real data science workflow: indicator analysis → data construction → feature engineering → modeling prediction, which is basically a complete DS pipeline. If you only study LeetCode usually, you may not be used to this type of question.
Q1: Calculation of core business indicators
Three core indicators were analyzed on the online ride-hailing platform data: average driver rating, proportion of drivers who master a second language, and order success rate. After integrating multiple data sources, the mean value of numerical ratings, the proportion of Boolean language skills, and the success rate of order status were calculated respectively. Organize the results into a regular (indicator type, value) table to ensure data accuracy and readability.
Q2: Build driver portrait data set
Integrate driver, vehicle and trip data, and build a complete driver portrait data set through data cleaning, feature calculation and multi-table association. Three parts of data are processed separately: calculating the driver's driving age, vehicle inspection days, and trip likes; integrating data through primary key association; processing missing values and filtering the final fields. Key steps include date calculation, Boolean statistics, multi-table merging and data cleaning.
Q3: Data preprocessing before machine learning
Preprocess the online ride-hailing driver data, including missing value filling, category encoding, numerical standardization and label conversion, to prepare the data for the machine learning model. Numerical and categorical features are processed separately: age is filled with the mean of the training set; categorical variables are mapped to ordinal coding; tip amounts are standardized; driver grades are converted to binary labels. Key points include preventing data leakage, maintaining consistency in training and test set processing, and controlling numerical accuracy.
Q4: Random forest modeling prediction
Based on the processed driver characteristic data, a random forest model is trained to predict the class classification of drivers in the test set. The training and validation sets are merged to fully use the data and trained using a random forest classifier. Key points include: balancing class weights to improve recall, setting random seeds to ensure reproducibility, and keeping prediction results aligned with the original index. Output the binary classification prediction results.
Write at the end
If you are also preparing for BCGX or other North American DS position OA and need reference for specific questions or ideas, you can Contact Programhelp . I have done similar OA several times myself and helped many students pass smoothly. If you are not sure, there are professional real-time assists to remind you of the direction at key links, which can save a lot of time of fumbling and stepping on pitfalls, making preparations more efficient and more confident.