BCG

I just took students on a tour two days ago BCGX Data Scientist OA. To be honest, this OA will feel like "déjà vu" if it does too much. The question types are basically stable. In the past two years, most of the sessions have been based on the same two sets of questions.

We have done this many times before and after, and the structure of the questions has hardly changed. As long as you have sorted out your ideas or written the code framework before, it is actually quite fast to do it on site. There are a total of 4 Python/data processing + simple modeling questions in this game. The overall difficulty is not high, and it is more like a small data pipeline + ML task.

Let’s briefly talk about the core content of these four questions.

Q1 Data statistics questions

The first question is a typical data statistics question. Given the driver table + multiple trip data files, we finally need to calculate several indicators and output them to CSV.

Mainly do three things:

Calculate the average driver rating from the driver table
Statistics on the proportion of drivers who speak a second language
Combine all itinerary files and calculate the proportion of successful itineraries

The only thing that needs to be noted here is that the itinerary data is in multiple files and needs to be put together before statistics are made. The rest are basically the normal operations of pandas, such as:

Read data
Conditional filtering
Find the mean
Statistical proportion

Finally, organize the three indicators into a small table and export it to CSV.

Q2 data preprocessing

The second question is more like the data preprocessing process in machine learning projects.

The main steps are probably these:

Fill in missing ages with the mean age of the training set (rounded)
Combine the training set and test set to facilitate unified processing
According to the order of training set categories, perform numerical coding on car models and second languages.
Standardize the tip amount to five decimal places
Map driver level A / B to 0 / 1

After processing, save the processed training set and test set separately.

This question is essentially a simplified version of feature engineering + preprocessing pipeline. If you usually use pandas or sklearn more, it is actually very smooth to write.

Q3 Data Integration

The third question is a little more complicated, mainly related to multi-table relationships.

The data comes from three tables:

Driver
Vehicle
Trip

Several new fields need to be constructed.

The first is the number of days between vehicle annual inspections
Calculate how many days have passed since the current time based on the time of the vehicle's last annual inspection.

The second is the driver’s driving experience
Use a simple formula to calculate:

Years of experience = 2023 - Year of driving start

The third one is the driver’s total number of likes
It is necessary to summarize the various like statistics in the trip table to the driver level.

The basic approach is:

Aggregate trip data by driver
Sum the fields related to likes
Then merge back to the driver table

If some drivers have no trip records, the number of likes needs to be filled in with 0.

Finally, sort out the required columns and output the summary CSV.

Q4 Random Forest Modeling + Threshold Adjustment

The last question is a simple classification model.

The process is relatively standard:

1 Read training data
2 Separate features and labels

Then do missing value processing:

Numeric Features → Median Fill
Category features → mode filling

Then code the categorical variables.

The model uses Random Forest, and sets a higher weight for category B because the question pays more attention to this category.

There is another critical step after training is completed:
Different thresholds need to be tried on the validation set.

The goals are:

Meet the minimum requirements for precision
At the same time, try to increase recall

After finding a suitable threshold, use this threshold to make predictions on the test set, and finally output the result file.

If you are also preparing for OA

We have taken many students through this kind of OA together, such as BCG and some consulting company DS post questions. In fact, the routines are similar. If you have happened to take a similar OA recently and feel a little unsure, or are worried about suddenly getting stuck during the exam, you can actually learn about ours in advance. OA real-time auxiliary service . During the exam, someone will help you check it together. Many students completed the entire OA smoothly in this way.

Jory Wang Senior Software Engineer at Amazon

Senior Amazon Engineer with rich practical experience in the development of core systems for infrastructure, specializing in system scalability, reliability, and cost optimization. Currently focused on mentoring FAANG SDE interviews, helping over 30 candidates secure L5/L6 offers within a year.

View Full Bio

BCG

Q1 Data statistics questions

Q2 data preprocessing

Q3 Data Integration

Q4 Random Forest Modeling + Threshold Adjustment

If you are also preparing for OA

Contact me