I just took students on a tour two days ago BCGX Data Scientist OA. To be honest, this OA will feel like "déjà vu" if it does too much. The question types are basically stable. In the past two years, most of the sessions have been based on the same two sets of questions.
We have done this many times before and after, and the structure of the questions has hardly changed. As long as you have sorted out your ideas or written the code framework before, it is actually quite fast to do it on site. There are a total of 4 Python/data processing + simple modeling questions in this game. The overall difficulty is not high, and it is more like a small data pipeline + ML task.
Let’s briefly talk about the core content of these four questions.

Q1 Data statistics questions
The first question is a typical data statistics question. Given the driver table + multiple trip data files, we finally need to calculate several indicators and output them to CSV.
Mainly do three things:
- Calculate the average driver rating from the driver table
- Statistics on the proportion of drivers who speak a second language
- Combine all itinerary files and calculate the proportion of successful itineraries
The only thing that needs to be noted here is that the itinerary data is in multiple files and needs to be put together before statistics are made. The rest are basically the normal operations of pandas, such as:
- Read data
- Conditional filtering
- Find the mean
- Statistical proportion
Finally, organize the three indicators into a small table and export it to CSV.
Q2 data preprocessing
The second question is more like the data preprocessing process in machine learning projects.
The main steps are probably these:
- Fill in missing ages with the mean age of the training set (rounded)
- Combine the training set and test set to facilitate unified processing
- According to the order of training set categories, perform numerical coding on car models and second languages.
- Standardize the tip amount to five decimal places
- Map driver level A / B to 0 / 1
After processing, save the processed training set and test set separately.
This question is essentially a simplified version of feature engineering + preprocessing pipeline. If you usually use pandas or sklearn more, it is actually very smooth to write.
Q3 Data Integration
The third question is a little more complicated, mainly related to multi-table relationships.
The data comes from three tables:
- Driver
- Vehicle
- Trip
Several new fields need to be constructed.
The first is the number of days between vehicle annual inspections
Calculate how many days have passed since the current time based on the time of the vehicle's last annual inspection.
The second is the driver’s driving experience
Use a simple formula to calculate:
Years of experience = 2023 - Year of driving start
The third one is the driver’s total number of likes
It is necessary to summarize the various like statistics in the trip table to the driver level.
The basic approach is:
- Aggregate trip data by driver
- Sum the fields related to likes
- Then merge back to the driver table
If some drivers have no trip records, the number of likes needs to be filled in with 0.
Finally, sort out the required columns and output the summary CSV.
Q4 Random Forest Modeling + Threshold Adjustment
The last question is a simple classification model.
The process is relatively standard:
1 Read training data
2 Separate features and labels
Then do missing value processing:
- Numeric Features → Median Fill
- Category features → mode filling
Then code the categorical variables.
The model uses Random Forest, and sets a higher weight for category B because the question pays more attention to this category.
There is another critical step after training is completed:
Different thresholds need to be tried on the validation set.
The goals are:
- Meet the minimum requirements for precision
- At the same time, try to increase recall
After finding a suitable threshold, use this threshold to make predictions on the test set, and finally output the result file.
If you are also preparing for OA
We have taken many students through this kind of OA together, such as BCG and some consulting company DS post questions. In fact, the routines are similar. If you have happened to take a similar OA recently and feel a little unsure, or are worried about suddenly getting stuck during the exam, you can actually learn about ours in advance. OA real-time auxiliary service . During the exam, someone will help you check it together. Many students completed the entire OA smoothly in this way.