This module requires a sandbox to complete. A sandbox gives you access to free resources. Your personal subscription will not be charged. The sandbox may only be used to complete training on Microsoft Learn. Use for any other reason is prohibited, and may result in permanent loss of access to the sandbox.

Microsoft provides this lab experience and related content for educational purposes. All presented information is owned by Microsoft and intended solely for learning about the covered products and services in this Microsoft Learn module.

Have you ever wanted to take data from one format into another? It’s likely that you have or that you’ll have to do it in the future. The process is called data wrangling and is a common task faced by developers. Before we learn how to wrangle data, we need some data files to work with.

As a developer, you’ll often need to extract information from logs. In this module, we’ll use NASA logs and the command line. To get started, you’ll need to download the datasets to the sandbox environment.

 Note

The process of signing in to activate the sandbox runs outside the learning module. You’re automatically returned to the module after you sign in.

The sandbox is active for a limited amount of time. If you plan to complete this module in multiple sessions, consider using Cloud Shell in the Azure portal to test steps so that your work is not lost.

  1. Use the following command to make a new directory named data.BashCopymkdir data
  2. Use the wget command to download the dataset.BashCopywget -P data/ https://raw.githubusercontent.com/MicrosoftDocs/mslearn-data-wrangling-shell/main/NASA-logs-1995.txt wget -P data/ https://raw.githubusercontent.com/MicrosoftDocs/mslearn-data-wrangling-shell/main/NASA-software-API.txt
  3. Change to the new directory by using the command cd.BashCopycd data
  4. Verify that you have the correct files by using the command ls.BashCopyls

You should see a NASA-software-API.txt file and a NASA-logs-1995.txt file.

The first file, NASA-Software-API.txt, is an open dataset that lists all the software in use by NASA. For more information on the original dataset, see NASA Open Source and General Resource Software API. The second dataset contains all the logged requests to the NASA Kennedy Space Center server.

Peek into the contents of your files

Recall that in UNIX, by default, a terminal has three streams: an input stream and two output-based streams. The input stream is referred to as stdin for standard input and is mapped to the keyboard. The standard output stream, or stdout, generally prints to the terminal or might be consumed by another program or process. The other output stream, stderr, is primarily used for status reporting and usually prints to the terminal like stdout.

You might be wondering why you needed that refresher. In the following units, we’ll be talking about programs and filters and their standard input and output. You’ll need a basic understanding of how these items are related. All of this information will make more sense as you move forward in the module.

Before you jump into wrangling your data, it’s useful to do some basic file inspection. You want to get an idea of what the raw data looks like.

Head and tail commands

The head and tail commands are used to examine the top (head) or bottom (tail) parts of a file. By default, both commands display 10 rows of content. If you want to display more or less rows, you can use the option flag -n to specify the number of rows to be printed to stdout.

We’ll use the tail and head commands to display the last and first five rows of the NASA-software-API.txt file, respectively.

  1. Type the command tail with the -n flag to display the last 5 rows in the file.BashCopytail -n 5 NASA-software-API.txt Your output should look like this:OutputCopySSC-00393 SSC 2013-05-17T00:00:00.000 "General Public" "Software Suite to Support In-Flight Characterization of Remote Sensing Systems" SSC-00424 SSC 2013-09-06T00:00:00.000 "General Public" "SSC Site Status Mobile Application" GSC-14732-1 GSFC 2004-06-09T00:00:00.000 "Open Source" "Tool For Interactive Plotting, Sonification, And 3D Orbit Display (TIPSOD)" GSC-14730-1 GSFC 2004-06-09T00:00:00.000 "Open Source" "Space Physics Data Facility Web Services" GSC-14726-1 GSFC 2004-06-09T00:00:00.000 "Open Source" "Earth Observing System (EOS) Clearinghouse (ECHO)"
  2. Type the command head with the -n flag to display the first 5 rows in the file.BashCopyhead -n 5 NASA-software-API.txt OutputCopyARC-14136-1 ARC 2001-10-19T00:00:00.000 "Academic Worldwide" "Adaptive Relevance-Learning Software Component (ARNIE)" ARC-14293-1 ARC 2005-09-19T00:00:00.000 "Open Source" "Genetic Graphs (JavaGenes)" ARC-14297-1 ARC 2003-11-06T00:00:00.000 "General US" "Automated Domain Decomposition Software, PEGASUS Version 5.0" ARC-14379-1 ARC 2002-03-27T00:00:00.000 "General US" "Man-machine Integration Design And Analysis System (MIDAS)" ARC-14400-1 ARC 2001-01-29T00:00:00.000 "General US" "PLOT3D Version 4.0"

nl filter

The nl filter reads lines from files or from the stdin. The output is printed to stdout. By default, the filter nl counts lines in a file and uses a tab to separate the line number from the text.

Let’s use nl with the flag -s to use = as a delimiter.BashCopy

nl -s = NASA-software-API.txt

Your output should list each line in the file, ending with this:OutputCopy

697=SSC-00424 SSC 2013-09-06T00:00:00.000 "General Public" "SSC Site Status Mobile Application"
698=GSC-14732-1 GSFC 2004-06-09T00:00:00.000 "Open Source" "Tool For Interactive Plotting, Sonification, And 3D Orbit Display (TIPSOD)"
699=GSC-14730-1 GSFC 2004-06-09T00:00:00.000 "Open Source" "Space Physics Data Facility Web Services"
700=GSC-14726-1 GSFC 2004-06-09T00:00:00.000 "Open Source" "Earth Observing System (EOS) Clearinghouse (ECHO)"

The nl filter has flags that allow you to change the increment value (-i), change the numbering format (ln, rn, rz), or change the starting number (-v).

wc command

The word count command wc counts the number of lines, words (separated by white space), and characters in a file or from stdin. The output is printed to stdout and separated by tabs.

Use the command wc to see the number of lines, words, and characters in NASA-software-API.txt.BashCopy

wc NASA-software-API.txt

Your output should look like this:OutputCopy

703    8917   81115 NASA-software-API.txt

You can see from the output that the file has 703 lines, 8,917 words, and 81,115 characters. Let’s check the output from the previous command, nl. The last printed line is:OutputCopy

700=GSC-14726-1 GSFC 2004-06-09T00:00:00.000 "Open Source" "Earth Observing System (EOS) Clearinghouse (ECHO)"

Did you notice that the index of this line is 700 instead of 703? What’s happening here?

This index mismatch happens because, by default, the command nl doesn’t number empty lines. Let’s run the command nl with the option flag -b a to count all the lines, including the empty ones.BashCopy

nl -b a NASA-software-API.txt

The last line in the output should be:OutputCopy

703  GSC-14726-1 GSFC 2004-06-09T00:00:00.000 "Open Source" "Earth Observing System (EOS) Clearinghouse (ECHO)"

The index now matches the lines counted with the command wc.