Recently, I went on holidays to the Vosges mountains in northeastern France. While one or two days were definitely too rainy to take electronics outside, I was able to take some pics with my Micro-Four-Thirds (MFT) camera of the beautiful autumn landscape, of our family dog (Team #rdogs!) and of the many, many fly agarics.
Back home and with a free weekend all to myself, I ventured to sort the photos and sent the best ones to my family + friends who were with me on the trip. This is always my least favorite part because I take a lot of pictures and a lot of them are…well…not worthy the time of looking at.
So I opened the photo viewer on my Linux laptop, went through the photos and deleted the ones I don’t like. “Done”, you’d think. Well, no. Why? Because, some months ago, I decided I really needed to have RAW files - just in case I’d ever want to seriously edit something (spoiler: I’m too lazy for that). Soo, whenever I push the shutter button nowadays, two files with the same name are stored on my SD card: a normal
JPG file and a RAW file with the
RW2 extension. So, for example
However, the Linux photo viewer only shows me the
JPG files. So after an hour of deleting
JPGs, I still needed to delete the corresponding
RW2 files of the JPGs I had deleted. And my dislike for doing stuff in the explorer / finder was big enough that I decided to automate this. Because the offending files are already deleted I set up a little test case for this post but I’ll include some screenshots that will show how much time - and nerves - I saved from this little R exercise.
Step 1: Get the data
First up is actually getting the file paths. For this, I use the good old
list.files command which will give you all files in a given folder. I get both the simple path and the full path to the file.1
# delete RAW files where the jpg is deleted library(dplyr) library(stringr) library(tidyr) library(tibble) # FOLDER <- "/home/frie/Pictures/2019/2019-10_vogesen/" FOLDER <- "/tmp/pics" full_paths <- list.files(FOLDER, full.names = TRUE) file_names <- list.files(FOLDER) df <- tibble::tibble(full_path = full_paths, file_name = file_names) df
## # A tibble: 8 x 2 ## full_path file_name ## <chr> <chr> ## 1 /tmp/pics/P1120001.RW2 P1120001.RW2 ## 2 /tmp/pics/P1120006.JPG P1120006.JPG ## 3 /tmp/pics/P1120006.RW2 P1120006.RW2 ## 4 /tmp/pics/P1120008.RW2 P1120008.RW2 ## 5 /tmp/pics/P1120009.JPG P1120009.JPG ## 6 /tmp/pics/P1120009.RW2 P1120009.RW2 ## 7 /tmp/pics/P1120010.JPG P1120010.JPG ## 8 /tmp/pics/P1120010.RW2 P1120010.RW2
There are 8 files in the folder. By manually looking at the data, I can easily see that I want to delete
In the real case, there were 942 😱. No way to easily see that at one glance!
Step 2: Determine which files need to be deleted
JPG version have the same file name, except for the extension. I first extract this “common” element of the file name using
tidyr::separate which splits a character vector at a certain pattern (the
sep argument) and directly puts the splitted things into new columns (hard to explain 😄, just see the result and compare with before!). This is honestly one of my favorite functions ever because it’s such a common task that would be otherwise really annoying. 2
df <- df %>% tidyr::separate(file_name, into = c("file_name_without_ext", "ext"), sep = "\\.") df
## # A tibble: 8 x 3 ## full_path file_name_without_ext ext ## <chr> <chr> <chr> ## 1 /tmp/pics/P1120001.RW2 P1120001 RW2 ## 2 /tmp/pics/P1120006.JPG P1120006 JPG ## 3 /tmp/pics/P1120006.RW2 P1120006 RW2 ## 4 /tmp/pics/P1120008.RW2 P1120008 RW2 ## 5 /tmp/pics/P1120009.JPG P1120009 JPG ## 6 /tmp/pics/P1120009.RW2 P1120009 RW2 ## 7 /tmp/pics/P1120010.JPG P1120010 JPG ## 8 /tmp/pics/P1120010.RW2 P1120010 RW2
Now I count how many files exist for each
file_name_without_ext by grouping by that variable and counting the number of rows using the little magic
n() function from dplyr. This is such a common pattern and I love that dplyr makes this so easy - I remember doing this for my Bachelor thesis without the tidyverse and it was soo difficult for me.
# could be replaced by shorthand: dplyr::add_count(file_name_without_ext) df <- df %>% dplyr::group_by(file_name_without_ext) %>% dplyr::mutate(n = n()) df
## # A tibble: 8 x 4 ## # Groups: file_name_without_ext  ## full_path file_name_without_ext ext n ## <chr> <chr> <chr> <int> ## 1 /tmp/pics/P1120001.RW2 P1120001 RW2 1 ## 2 /tmp/pics/P1120006.JPG P1120006 JPG 2 ## 3 /tmp/pics/P1120006.RW2 P1120006 RW2 2 ## 4 /tmp/pics/P1120008.RW2 P1120008 RW2 1 ## 5 /tmp/pics/P1120009.JPG P1120009 JPG 2 ## 6 /tmp/pics/P1120009.RW2 P1120009 RW2 2 ## 7 /tmp/pics/P1120010.JPG P1120010 JPG 2 ## 8 /tmp/pics/P1120010.RW2 P1120010 RW2 2
Now I filter those rows where
n == 1 - those are the
RW2 files that are the leftover companions of the
JPGs I deleted manually. Just to be sure, I also add the
ext == "RW2" condition to the filter statement.3
delete_df <- df %>% dplyr::filter(n == 1 & ext == "RW2") nrow(delete_df) # only 2 files left
##  2
Step 3: delete, delete, delete!
dplyr::pull to get the
full_path variable from the data frame.4 I also add a small check that I indeed have only
RW2 files - all this making sure thing is getting a bit out of hand but better safe than sorry. 😉
And finally: delete, delete, delete that sh*t with
delete_paths <- delete_df %>% dplyr::pull(full_path) print(delete_paths)
##  "/tmp/pics/P1120001.RW2" "/tmp/pics/P1120008.RW2"
# some quick check # don't delete JPG stopifnot(all(stringr::str_ends(delete_paths, "RW2"))) stopifnot(length(delete_paths) == 2) # delete! file.remove(delete_paths)
##  TRUE TRUE
This deletes the two files that do not have a
JPG companion. In the real use, my script successfully deleted 258 files as can be seen by comparing the before (posted at the beginning of this post) and after screenshots of my explorer.
Hurray for the power of computers! 🎉
I don’t know whether this brought any considerable insight to anyone. 😄 After all, this is not the usual use case for R - a well written shell command would’ve achieved the same. Or… actually manually deleting the files… But no, this was never an alternative.
Take away from this? Being able to program makes you lazy - or rather it gives you the ability to be lazy by just automating everything away. 😎 👅 And in my opinion, this is just another excellent reason to: keep coding. ❤️
The double call could be avoided by splitting the full path using something like
tidyr::separatebut I was lazy.↩
Sidenote: There’s also
tidyr::separate_rowswhich is even more awesome!↩
If I did my manual deletion process how I described it, this should not be necessary as a JPG should always have a “partner” RAW file. But who knows? 🤷↩
pullis just like
$- it just integrates better into pipe workflows. As I broke up the pipe for “educational” purposes, it does not really make sense here but I thought I left it in just in case someone did not know about it yet.↩