Reading Big Data into MATLAB

by admin in Matlab on May 11, 2019

In this article we will discuss datastore, one of the new big data capabilities introduced since MATLAB R2014b.

Contents

  • About the Data
  • What is a Datastore?
  • Defining Our Input Data
  • Creating the DataStore
  • Preview the Data
  • Select Data to Import
  • Adjust Variable Format
  • Read in First Chunk
  • Example 1: Read Selected Columns of Data for Use in Memory
  • Example 2: Filter Data Down to a Subset for Use in Memory
  • Example 3: Perform Analysis on Chunks of Data and Combine the Results
  • Extending the Use of Datastore
  • Conclusion

About the Data

datastore is used for reading data that is too large to fit in memory. For this example, we will be reading in data from the vehicle census of Massachusetts. It is a catalog of information about vehicle registered from 2008 to 2011. The dataset contains information about individual cars registered including vehicle type, location where the vehicle is housed, rated MPG, and measured CO_2 emissions. You can learn more about the data and even download it yourself, here. I have renamed the files in the demo for clarity’s sake, but this is where they came from originally.

What is a Datastore?

As mentioned, a datastore is an object useful for reading collections of data that are too large to fit in memory.

Defining Our Input Data

datastore can work with a single file or a collection of files. In this case, we will be reading from a single file. Our file does not include variable names at the top of the file. They are listed in separate header file as defined below.

% Define Data File and Header File 
datafile = 'vehiclebig.csv';
headerfile = 'varnames.txt';
% Read in Variable Names
fileID = fopen(headerfile);
varnames = textscan(fileID,'%s');
varnames = varnames{:};
fclose(fileID);

Creating the DataStore

We can now create our datastore by giving the name of the data file as the import to the datastore function. We also specify our datastore not use the first row of our file as variable names. We will set those variable names explicitly using the names found in the ‘varnames.txt’ file.

ds = datastore(datafile,'ReadVariableNames',false); 
% Set Variable Names
ds.VariableNames = varnames
ds =    TabularTextDatastore with properties:                       
Files: {
'H:\Documents\LOREN\MyJob\Art of MATLAB\SarahZ\datastore\vehiclebig.csv' }
ReadVariableNames: false
VariableNames: {'record_id', 'vin_id', 'plate_id' ... and 42 more}

Text Format Properties:
NumHeaderLines: 0
Delimiter: ','
RowDelimiter: '\r\n'
TreatAsMissing: ''
MissingValue: NaN

Advanced Text Format Properties:
TextscanFormats: {'%f', '%f', '%f' ... and 42 more} ExponentCharacters: 'eEdD'
CommentStyle: ''
Whitespace: ' \b\t'
MultipleDelimitersAsOne: false

Properties that control the table returned by preview, read, readall: SelectedVariableNames: {'record_id', 'vin_id', 'plate_id' ... and 42 more} SelectedFormats: {'%f', '%f', '%f' ... and 42 more} RowsPerRead: 20000

Note that we haven’t read in our data yet. We have just provided an easy way to access it through ds, our datastore.

Preview the Data

A really nice thing about a datastore is that you can preview your data without having to load it all into memory. datastore reads the data into a table which is a data type in MATLAB designed to work well with tabular data.

data = preview(ds); 
whos data data(:,1:7)
% Look at first 7 variables
Name   Size            Bytes  Class    Attributes   
data 8x45 21426 table

ans =
record_id vin_id plate_id me_id owner_type start_odom start_date ________ ______ ________ _____ _________ __________ __________ 2 1 5.3466e+06 1.2801e+07 1 NaN '2011-11-07' 4 1 5.3466e+06 1.1499e+07 1 NaN '2009-08-27' 7 2 6.6148e+06 1.2801e+07 1 NaN '2011-11-19' 9 2 6.6148e+06 1.1499e+07 1 NaN '2008-07-01' 10 3 6.4173e+06 1.2801e+07 1 NaN '2011-12-06' 1 1 5.3466e+06 1 1 30490 '2009-09-01' 3 1 5.3466e+06 2 1 55155 '2010-10-02' 5 2 6.6148e+06 3 1 5 '2008-07-02'

By default, datastore will read in every column of our dataset. datastore makes an educated guess for the appropriate format for each column (variable) of our data. We can, however, specify a subset of columns or different formats if we wish.

Select Data to Import

We can specify which variables (columns) by setting the Selected Variable Names property of our datastore. In this case, we only want to bring in 5 columns out of the 45.

ds.SelectedVariableNames = {'model_year', 'veh_type', ...   'curbwt','mpg_adj','hybrid'}; 
preview(ds)
ans =      
model_year veh_type curbwt mpg_adj hybrid
__________ ________ ______ _______ ______
2008 'Car' 3500 21.65 0
2008 'Car' 3500 22.54 0
2008 'SUV' 4500 16 0
2008 'SUV' 4500 17 0
2005 'Truck' 5000 13.29 0
2008 'Car' 3500 22.09 0
2008 'Car' 3500 21.65 0
2008 'SUV' 4500 16.66 0

Adjust Variable Format

We can adjust the format of the data we wish to access by using the SelectedFormats property. We can specify to bring in the vehicle type as a categorical variable by using the %C specifier.

ds.SelectedFormats; ds.SelectedFormats{2} = '%C'  
% read in as a categorical
ds =    
TabularTextDatastore with properties:
Files: {
'H:\Documents\LOREN\MyJob\Art of MATLAB\SarahZ\datastore\vehiclebig.csv'
}
ReadVariableNames: false
VariableNames: {'record_id', 'vin_id', 'plate_id' ... and 42 more} Text

Format Properties:
NumHeaderLines: 0
Delimiter: ','
RowDelimiter: '\r\n'
TreatAsMissing: ''
MissingValue: NaN

Advanced Text Format Properties:
TextscanFormats: {'%*q', '%*q', '%*q' ... and 42 more} ExponentCharacters: 'eEdD'
CommentStyle: ''
Whitespace: ' \b\t'
MultipleDelimitersAsOne: false

Properties that control the table returned by preview, read, readall: SelectedVariableNames: {'model_year', 'veh_type', 'curbwt' ... and 2 more}
SelectedFormats: {'%f', '%C', '%f' ... and 2 more} RowsPerRead: 20000

Read in First Chunk

We can use the read function to read in a chunk of our data. By default, read reads in 20000 rows at a time. This value can be adjusted using the RowsPerRead property.

testdata = read(ds);
whos testdata
Name             Size             Bytes  Class    Attributes   
testdata 20000x5 683152 table

After you read in a chunk, you can use the hasdata function to see if there is still additional data available to read from the datastore.

hasdata(ds)
ans =
     1

By using hasdata and read in a while loop with your datastore, you can read in your entire dataset a piece at a time. We will put in a counter just to track how many read operations took place in our loop.

counter = 0; while hasdata(ds)     
% Read in Chunk
dataChunk = read(ds);
counter = counter + 1; end counter
counter =
   825

By using reset, we can reset our datastore and start reading at the beginning of the file.

reset(ds)

Now that we see how to get started using datastore, let’s look at 3 different ways to use it to work with a dataset that does not entirely fit in the memory of your machine.

Example 1: Read Selected Columns of Data for Use in Memory

If you are interested in only processing certain columns of your text file and those columns can fit in memory, you can use datastore to bring in those particular columns from your text file. Then, you can work with that data directly in memory. In this example, we are only interested in the model year and vehicle type of the cars that were registered. We can use readall instead of read to import all the selected data instead of just a chunk of it at a time.

ds.SelectedVariableNames = {'model_year', 'veh_type'}; 
cardata = readall(ds);
whos cardata
Name               Size                Bytes   Class    Attributes
cardata 16145383x2 161456272 table

Now that you have the data read into MATLAB, you can work with it like you would normally work with your data in MATLAB. For this example, we will just use the new histogram function introduced in R2014b to look at the distribution of vehicle model years registered.

figure histogram(cardata.model_year) 
hold on
histogram(cardata.model_year(cardata.veh_type == 'Car'))
hold off
xlabel('Model Year')
legend({'all vehicles', 'only cars'},'Location','southwest')

Example 2: Filter Data Down to a Subset for Use in Memory

Another way to subset your data is to filter the data down a chunk at time. Using datastore you can read in a chunk of data and keep only data you need from that chunk. You then continue this process, chunk by chunk, until you reach the end of the file and have only the subset of the data you want to use.

In this case, we want to extract a subset of data for cars that were registered in 2011. The new variables we are loading in (e.g., q1_2011), contain either a one or zero. Ones represent valid car registrations during that time. So we only save the rows which contain a valid registration sometime in 2011 and discard the rest.

reset(ds) 
ds.SelectedVariableNames = {'model_year','veh_type',... 'q1_2011','q2_2011','q3_2011','q4_2011'};

data2011 = table;
while hasdata(ds)
% Read in Chunk
dataChunk = read(ds);
% Find if Valid During Any Quarter in 2011
reg2011 = sum(dataChunk{:,3:end},2);
% Extract Data to Keep (Cars Registered in 2011)
idx = reg2011 >= 1 & dataChunk.veh_type == 'Car';
% Save to Final Table
data2011 = [data2011; dataChunk(idx,1:2)];
end
whos data2011
figure
histogram(data2011.model_year)
xlabel('Model Year')
legend({'cars registered in 2011'})
       Name           Size                Bytes  Class    Attributes
data2011 3503434x2 35036782 table

Example 3: Perform Analysis on Chunks of Data and Combine the Results

But what if we can’t hold the subset of the data we are interested in analyzing in memory? We could instead process the data a section at a time and then combine intermediate results to get a final result. In this case, let’s look at the % of hybrid cars registered every quarter (in terms of total cars registered). So, we compute a running total of the number of cars registered per quarter as well as the number of hybrids registered per quarter. Then, we calculate the final % when we have read through the entire dataset.

% Reset Datastore 
reset(ds)
% Select Data to Import
quarterNames = varnames(end-15:end);
ds.SelectedVariableNames = [{'veh_type', 'hybrid'} quarterNames'];
% Read in Vehicle Type as a Categorical Variable
ds.SelectedFormats{1} = '%C';
totalCars = zeros(length(quarterNames),1);
totalHybrids = zeros(length(quarterNames),1);

while hasdata(ds)
% Read in Chunk
dataChunk = read(ds);
for ii = 1:length(quarterNames)
% Loop over car model years
% Extract Data
idx = dataChunk{:,quarterNames(ii)}== 1 & dataChunk.veh_type == 'Car'; idxHy = idx & dataChunk.hybrid == 1;
% Perform Calculation
totalCarsChunk = sum(idx);
totalHybridsChunk = sum(idxHy);
% Save Result
totalCars(ii) = totalCarsChunk + totalCars(ii);
totalHybrids(ii) = totalHybridsChunk + totalHybrids(ii);
end
end
percentHybrid = (totalHybrids./totalCars)*100; 
figure
scatter(1:length(percentHybrid),percentHybrid,'filled')
xlabel('Inspection Year')
ylabel('% Hybrids')
% Label tick axes
ax = gca;
ax.TickLabelInterpreter = 'none';
ax.XTick = 1:length(quarterNames);
ax.XTickLabel = quarterNames;
ax.XTickLabelRotation = -45;

Extending the Use of Datastore

You can also use datastore as the first step to creating and running your own MapReduce algorithms in MATLAB. Learn more about running MapReduce algorithms with MATLAB here. Perhaps this topic will be a blog post in the future.

Conclusion

Do you think you can use datastore with your big data? share us your opinion down.

Categories

Share Your Valuable Opinions