Table of Contents
Overview
This month, a B2B company is migrating its CSV files onto a database. These CSV files contain years of oddly formatted data that was not standardized. The company understands that doing such a task themselves might impose significant downtime and may not be worth the time taken to do it themselves. They have hired you, along with a team of other developers, to create a stop-gap solution for the time being by using a hybrid approach.
They plan to expose an API that leverages the CSV files as a database to ensure server uptime remains unaffected during extensive data transfers. Utilizing CSV files conventionally proves highly inefficient since it involves disk-based read/write operations, which are considerably slower. (For further insights, consider examining Amdahl’s Law, which essentially underscores that I/O-bound tasks are inherently sequential and challenging to accelerate without hardware modifications.) To address this inefficiency, the plan is to implement a dictionary as an associative cache for vital CSV data, particularly data related to ongoing projects. This dictionary will be stored in memory, thereby enhancing I/O performance for many operations.
Understanding the Problem
Let’s take a look at a table visualization of what the data in one of the CSV’s looks like:
id | date | manager | tasks_csv_path | expected_cost | expected_payment | ongoing | … |
---|---|---|---|---|---|---|---|
43 | 2008-12-13 | Lempel | website.com/assets/csvs/2008-12-13/tasks-43.csv | $32039.93 | $40940.00 | false | |
78 | 2009-02-17 | Welch | website.com/assets/csvs/2008-02-17/tasks-78.csv | $38731.04 | $41500.00 | false | |
126 | 2023-09-28 | Ziv | website.com/assets/csvs/2023-09-28/tasks-126.csv | $63040.49 | $79000.00 | true |
Our job is to essentially abstract away these details into three functions:
- parse_csv
- Should parse the CSV file into a dictionary object (can use the ‘csv’ Python library)
- get_project
- Check if the project is already in the cache, if it is then return it
- If not in the cache, get details from the CSV
- Add or remove from the cache with
cache_fields
- cache_fields
- Here, a choice needs to be made on what method to implement for adding and removing elements from the “cache”. A common policy for this is least-recently-used (or LRU). But, in this case, we expect that these projects won’t be accessed frequently, so the spatial and temporal locality from LRU may not be very useful. Maybe it’s smarter to use a mostly static cache with a priority for ongoing projects.
Optimization
The optimization part of this sample problem is quite basic. It’s essentially just a process of trying out different cache policies and experimenting with ways to read the CSV faster to get the maximum possible speedup over a naive approach (e.g. cacheless approach)