Many programming languages have map and reduce functions. In this project you’ll learn about why that stuff is cool and you’ll do some katas to get your heads around it.
The first thing to know is that map reduce is hella useful when it comes to processing big data. If you wanted to chug through 1TB of stuff, and distribute that processing across a fleet of computers like a boss, then you’ll need a solid understanding of map-reduce.
MapReduce became popular because of a tool called Hadoop. You can read about how Hadoop does it here: https://www.guru99.com/introduction-to-mapreduce.html
Please use git feature branching. Use a separate branch for each task here.
Write a well architected command line utility that:
Please pretend that you are running something like Hadoop. Imagine that your data gets split across 50 computers and your map and reduce functions need to run on those machines, and that the results of those operations are combined on one machine right at the end.
That means you cant have a mapper or reducer function that has knowledge of anything except its direct inputs and outputs.
Eg you MUST NOT do this:
// pseudocode
words = text.split()
word_counts = {}
words.map((word)=>{
if (word_counts.get(word)){
word_counts[word] = word_counts[word] + 1
}
else{
word_counts[word] = 1
}
})
Because the mapper function needs to know about stuff outside the mapper function.
Make use of the work you did in the previous task to make a command-line utility that:
For example, this page has the word hippopotamus. It is incredibly unlikely that any other syllabus page has the word hippopotamus.
So if this page’s url is an input for your script, then part of your output would be:
hippopotamus:
- {this url}
second most common word
- {some other url}
- {maybe another one}