INPUT and INFILE
SAS System procedures can operate only on SAS data sets.Quite often, however,the data that you need to process are in a raw form.The first step is,therefore,to transform the raw data into a SAS data set.
The work of manufacturing this is done in a SAS DATA step through the use of a DATA statement.This statement names the SAS data set you are creating.The raw data are then read into the data set via an INPUT statement.
The seemingly simple INPUT statement is really a SAS System power house in that it can create a SAS data set from raw data existing in a wide variety of formats.The raw data may exist in a file external to the environment in which the SAS code is being prepared(in which case they are usually referred to by an INFILE statement),or they can be entered instream along with the SAS code by means of a DATALINES statement.SAS software will also recognize the older CARDS statement as the beginning of raw data input.In this first chapter,you will see the power,ease of use,and flexibility of this key DATA statement. Reading Raw Data Separated by Spaces
There are a variety of different styles of INPUT code that can be used to read raw data. List input reads data into a SAS data set using a "space delimited" form of data entry. This method can be used when each raw data value is separated from the next one by one or more spaces.This form of data entry has its limitations to be sure, but let us first lay it out via an example before we pick it apart.Suppose you have the following raw data values and you want to create a SAS data set from them:
Suppose you have the following raw data values and you want to create a SAS data set
from them:
ID HEIGHT WEIGHT GENDER AGE 1 68 144 M 23 2 78 202 M 34 3 62 99 F 37 4 61 101 F 45
This can be done using list input as follows:
Example
ID HEIGHT WEIGHT GENDER AGE 1 68 144 M 23 2 78 202 M 34 3 62 99 F 37 4 61 101 F 45
The previous code produces the following output:
Output from Example - Reading Raw Data Separated by Spaces
DATA LISTINP; INPUT ID HEIGHT WEIGHT GENDER $ AGE; DATALINES; 1 68 144 M 23 2 78 202 M 34 3 62 99 F 37 4 61 101 F 45 ; PROC PRINT DATA=LISTINP; TITLE 'Example 1'; RUN;
There are several important points to notice about this basic example.Raw data lines are read beginning with the line immediately following the DATALINES statement.You signify the end of your data input with a lone semicolon (called a NULL statement) on the line following your last line of data.
Some programmers prefer a RUN statement followed by a semicolon instead.Take your choice. Next,the INPUT statement lists the variables you wish to create in the same order as the corresponding data values listed below the DATALINES statement.You cannot skip over any data values with this simple form of list input.Later in this chapter we'll demonstrate how to jump around the raw data line when reading in the data.
The SAS System reads data as either character or numeric,and then stores them as such.Numeric data can contain numbers or numeric missing values(see below), while character data can contain numbers, letters, character missing values,and any special characters(e.g. _, #, &).
In this example,GENDER is a character variable because it contains the alphanumeric characters,M or F.We indicate that GENDER is a character variable by following it with a dollar sign ($) in the INPUT statement.
Without the dollar sign,the program would be expecting numerical values for GENDER (and would get really upset when it encountered M's and F's). Another point to notice is that the data values have to be separated by at least one blank space as they all are in the previous example.Data values can be separated by more than one space (possibly to improve readability).
The lines of data in the following code could be substituted for the original four lines of data with no change in the resulting data set:
DATALINES; 1 68 144 M 23 2 78 202 M 34 3 62 99 F 37 4 61 101 F 45
Messy isn't it? It will, however,work just fine.
The PRINT procedure statements are included in this program so that you can see that the DATA step reads the data as expected.For now,all you need to know about PROC PRINT is that it is a procedure that will print the contents of a SAS data set.Chapter 9, "PROC PRINT," contains more information on PROC PRINT .
Handling Missing Values
Now suppose you have a missing value of HEIGHT for observation 2,and enter your data as follows:
DATALINES; 1 68 144 M 23 2 202 M 34 3 62 99 F 37 4 61 101 F 45
The data item is missing so why not just leave it out? Although itlooks right,this will get you into big trouble with list input.If you leave the value blank,the program will just look for the next value it finds,after at least one space and read it as the data for the second variable.
In this case,after reading in a value of 2 for ID,the SAS System looks for a value for HEIGHT. It finds 202 and accepts it as a value for HEIGHT. It then attempts to read M as the value for WEIGHT.WEIGHT is a numeric variable and cannot have a non-number as a value.
The result will be an error message in the log and a missing value for WEIGHT.Next,3 4 is read as the value for GENDER.This is legal because GENDER is a character variable and can have any alphanumeric content.
The program is still looking for a value for AGE,so it goes to the next line to read the first data value on that line,3,as AGE.Since the value for AGE is the last value the INPUT statement looks for,the program completes building the current observation, brings in the next input line, and starts building the next observation.
It's really amazing how wrong things can get when you make one simple innocent mistake.So how do you solve this problem when reading in list input data? Use a period (.),separated by one or more blanks from surrounding data to indicate that a data value is missing.
The period acts as a place holder.It tells the INPUT statement that there is no value to be read here,and to get on about its business.In our example, the correct way to indicate a missing value for HEIGHT for observation 2 is as follows:
DATALINES; 1 68 144 M 23 2 . 202 M 34 3 62 99 F 37 4 61 101 F 45
|