· 8 min read
How to Build a Data Warehouse for Games from Scratch
Ramón Lastres
CTO at GameAnalytics
Over our last couple of blogs around data warehouses, we’ve explained how they let you analyze data from across your portfolio and look at what insights you can gather from them.
Now, we’ll dive into how to build a data warehouse. What steps do you need to take and what resources will you need? To figure this out, we’ve rounded up the costs, steps, and tools we think you’ll need to get started. Please note, that we haven’t included the cost of running an engineering department (which you’ll need), which can end up being a lot of $$$.
What do I need to get started?
Before you start, you’ll need to ensure you have the right people. You’ll likely need a software or data engineer, and perhaps an architect or DevOps engineer. You’ll also need to budget for tools like data storage, servers, and database licenses. You might decide to host these on-premise or in the cloud. It’s up to you.
Alternatively, you can use our Player Warehouse and let us take the load off your shoulders.
Six steps to your own data warehouse
If you’re looking to create your own data warehouse, there are six steps you’ll need to follow. Make sure you give yourself ample time to finish each step – it’s hard to predict exactly how long each step will take. Some might take three weeks, others three months. Chat with your developer to check.
When pricing up this entire process, it’s a bit difficult to give accurate estimates. As you can imagine, many moving parts are at play. We’ve put some rough costs in this article, but don’t take it for the gospel. You can consider it a rough number that a studio of a semi-large game could expect to pay.
Step one: Add your tracking code
Average cost: $0.
Average time: Anywhere between a week and 6 months.
Typical tools: GameAnalytics core tool.
In this stage, you’re adding the logic to track events in your game that you’ll eventually send to the servers. There’s a lot you’ll want to capture. What are players doing? How long do they stay logged in? When do they make a purchase? What level were they on when they signed out?
Thankfully, you don’t need to plan all that logic yourself. Every analytics platform has its own API you can plug into. Or you can just use our core tool. We’ve already defined all these events and set up the tracking, so you just need to use our API and SDK to grab the information.
Using our core tool will likely take you a week to set up and learn how to grab the relevant information. But if you’re looking to create your own analytics, this could take a long time. (It’s taken us years to get our core tool to where it is now.)
Step two: Collect the data
Average cost: $2,000 per month.
Average time: One month.
Typical tools: AWS S3, REST API, AWS Kinesis Data Firehose, AWS API Gateway.
Once you’ve got the code in place to collect your games’ data, you need somewhere to send it all and store it. Remember, you’ll need to encrypt the data you send, so any service you use needs to work with TLS – the standard security protocol when sending data over the internet.
You’ll need two services
Sometimes these come bundled together, and you’ll have to get them separately. The first is an HTTP service at the front, like AWS API gateway. This handles how a device connects to your storage.
The second is your storage, like AWS S3. This is what holds the actual data.
Combining AWS API gateway with S3 is a good mix, as it’ll mean you can use TLS and easily scale it up if you get more traffic than expected.
Consider the cost of the data, storage, and processing power
Whatever service you choose, you need to account for the data transfer costs – specifically your ‘egress costs’. You’ll also need the computer power necessary to handle that much-encrypted traffic, the connections themselves, and parsing the events. Calculating this cost can be a bit tricky. It depends on the volume of your data. You can calculate it here if you’re using AWS Stack (we also wrote a blog on how we reduced the cost of HTTP(S) APIs on AWS. Check it out here).
What other options are there?
Many cloud providers offer managed services, taking the load off your hands. For example, AWS Kinesis Date Firehose or API Gateway. These take a lot of the strain off you, but are quite expensive. They’re also not designed specifically for games, so it can be tricky to set them up properly.
Step three: Transform the data
Average cost: $1,000 per month.
Average time: Depends on data size and sources.
Typical tools: AWS Glue, Hadoop (EMR), Kafka, Kinesis.
Now that you’ve collected all that data, you need to convert it all into a standardized format. The idea here is to process the data so that you can add it to the warehouse. This stage is crucial if you want to bring together multiple data sets from different sources, otherwise, you won’t be able to search through the data properly or find links between those data sets.
Depending on how you’ve set up your original databases and how many different sources you’re gathering from, this could be relatively simple or quite complex. Many sources with different rules and data sets will be difficult to standardize.
Usually, you’ll be processing in batches, using AWS Glue or Spark on Hadoop (EMR). But in some cases, you might want real-time data. In those cases, you’ll want a streaming service like Kafka or Kinesis.
A word on pricing
These steps are usually quite pricey. It really depends on the scale of your solution. Post-processing data alone (adding the cost of the transform) could get up to $12K, and then if you add the cost of the player ETL, this can be around $5K per month.
Depending on how much data you’re dealing with (so say billions of events), it can cost you tens of thousands of dollars each month.
Step four: Load data to the warehouse
Average cost: Anywhere between $100 and $2,000 a month.
Average time: Ongoing. Usually once a day.
Typical tools: BigQuery Data Transfer, AWS Glue, Apache Druid, or Click house.
Once you’ve prepared all your data in a standard format, it’s time to send it to the warehouse.
Rather than sending it all at once, it’s best to send a batch once a day. You’ll also want to manage how much you send based on your typical queries and the data lifecycle. This also means you can handle the costs and not spend too much at once.
The most popular warehouses are Snowflake, BigQuery or AWS Redshift. And if you need real-time data, you’ll want to look into Apache Druid, Apache Pinot, or Clickhouse.
Step five: Monitor and troubleshoot
Average cost: Free (or not too pricey).
Average time: Around a week.
Typical tools: AWS, Cloudwatch, Datadog, Grafana, Pingdom, or Pagerduty.
Once everything is set up, and the devices send data, it’s time to ensure it stays online. Any downtime is going to lose you data, which could be crucial.
You’ll need someone on call to deal with problems, day and night. Usually, a small team of two or three should be able to handle this, along with their other responsibilities. But bear in mind that they’ll need to check the status regularly and have the skills to fix any problems on their own.
Step six: Analyze and visualize
Average cost: Free (although can be thousands, depending on what you need).
Average time: At least a couple of weeks.
Typical tools: Tableau, Superset, Holistics, Looker, Google Studio, or AWS QuickSight.
Finally, you need to make that data useful. If you’ve got someone who knows SQL, they can run queries. But this often needs specialist knowledge. Usually, studios go for a visualization tool or framework, like Tableau, Superset, Holistics, Looker, or AWS QuickSight.
Our Player Warehouse is ready to go
Instead of all that faff, you could just use our Player Warehouse. We’ve designed it specifically for game developers. And you can get started in minutes, not months. Let us handle keeping it up and running. And save yourself all the hosting and processing costs.
That way you can focus on the last step: analyzing the data. So get started and make the most of your games.