Tutorial: Replace Excel's VLOOKUP with an R Function
What is a VLOOKUP?
Before we get into it, let's look at how Excel's VLOOKUP function works so it is clear what we're reproducing in R. VLOOKUP is used to copy data from one dataset to another based on matching values. "Dataset" in this case can refer to a column, table, sheet, etc. For example, you may have one sheet that has contact info for your customers. From your email program, you have a list of email addresses with an action taken on an email campaign. Now you want to combine your contact data with your email open/click data. With VLOOKUP, you can match by the "email" column in each dataset and copy over the open/click data to the contact data.
Example:
So why not just use a VLOOKUP?
Great question. If you are only doing VLOOKUPs here and there, and on relatively small datasets, it may not make sense to invest your time in writing a function in R. However, if you find yourself using VLOOKUPs routinely, or needing several in a single project, you can save yourself a lot of time in the long term by investing a little time now, developing an R function.
VLOOKUPs also have some limitations that can be overcome with R. First, VLOOKUPs only read left to right, so you have to make sure your lookup value (what you are matching on) is to the left of the data you wish to copy. Second, a VLOOKUP formula only handles one column at a time. If you need to copy multiple columns, it takes multiple VLOOKUP formulas. VLOOKUPs can also cause performance issues. Several VLOOKUPs in a large dataset can slow things down or crash Excel all together. In addition to being more efficient, an R function can overcome all of these issues.
Why R?
Well... this is my tutorial, and I chose R. The argument here isn't really pro-R specifically, it's an argument for an automated function vs manual work. So, if Python is your thing, use Python. Use whatever language you want. That said, if you work with data often and you're looking for a language to learn, R is great. It's easy to learn, free, and pretty powerful. As you will see here, even if you are new to R, this tutorial will be pretty easy to follow and implement.
The R function
The R function explained
LINES 1-8: The top of the file is a list of comments describing the function's purpose and the arguments that the function uses.
LINE 10: The first thing we do is define the function . I called this function vlookup. The vlookup function takes 6 arguments that are described in lines 3-8. Ultimately, when this function is called, you will enter your info (file names, column names, etc.) in place of these arguments. We'll discuss calling the function in more detail later.
LINES 13-14: Next we read in the two files that are defined in the first two function arguments.
LINE 17: Here we define a variable called "join" and set it equal to the value of the "jointype" argument. Unlike a VLOOKUP, which is limited to LEFT JOINS (returns all rows from dataset 1, matching rows from dataset 2), the jointype argument allows this function to return LEFT, RIGHT, INNER, and OUTER JOINS. Each join type is described in the function comments.
LINES 20-35: Here we use an if, else if, else statement to merge the files based on the arguments defined in the function. Each argument is incorporated into the merge function. Within the merge function, "all" is handled differently based on the "join" variable. Each if is described with a comment to help you understand how each join type functions. Note: The default for the "jointype" argument is "", so if you leave it blank when you call the function, you'll get an INNER JOIN. This is usually what you'll want.
LINE 38: The last thing we do is return the results. I chose to write the data to a CSV file, the name of which is defined in the function argument "filename". Note: The filename argument has a default of "MatchedFile.csv" so if you leave it blank, your file will be called "MatchedFile.csv". While I chose to export a CSV file, you could handle the results of the merge any number of ways. Line 38 could be used to set a variable equal to the results so that you can do more work with the dataset. What happens with the merged data here is up to you.
Calling your function: Once your function is done, you can call it anytime you need to match and merge two files. The function call will look like this (link to code):
As you can imagine, calling the vlookup function is much faster than doing a VLOOKUP in Excel. As the graphs at the top of this post show, you will be more productive out of the gate using Excel, but if you spend time up front automating the process in R, you'll be far more productive long term.