MapReduce Composite Key Operation-Part2

Composite Key Operation In complex operation where we require multi column operation, basic key may not help. For example if we need to calculate population group by country and state then selection of key matter. If we choose the composite key wisely it could solve problem easily. We can create composite key by implementing WritableComparable interface which make it use like any normal WritableComparable interface object. Below is the composite key with two fields country and state which overwrite compare method to do sorting based on country and state. It provides write() and readfields() method to searlized and de-searlized attributes.

private static class CompositeGroupKey implements
WritableComparable<CompositeGroupKey> {
String country;
String state;
public void write(DataOutput out) throws IOException {
WritableUtils.writeString(out, country);
WritableUtils.writeString(out, state);
}
public void readFields(DataInput in) throws IOException {
this.country = WritableUtils.readString(in);
this.state = WritableUtils.readString(in);
}
public int compareTo(CompositeGroupKey pop) {
if (pop == null)
return 0;
int intcnt = country.compareTo(pop.country);
return intcnt == 0 ? state.compareTo(pop.state) : intcnt;
}
@Override
public String toString() {
return country.toString() + ":" + state.toString();
}
}

We are using above composite key to create MapReduce job to count total population group by Country and State. input

Country State City Population (Mil)
USA, CA Su 12
USA, CA SA 42
USA, CA Fr 23
USA, MO XY 23
USA, MO AB 19
USA, MO XY 23
USA, MO AB 19
IND, TN AT 11
IND, TN KL 10

output

Country State Total Population
IND TN 21
USA CA 77
USA MO 84

Mapper Program Once we define composite key we create the mapper class which use input generated from InputFormat. Input Format split the file and pass to individual Mapper which invoke multiple map tasks. Map task transform input split record into Key-value pair where Key and Value should be implement WritableComparable interface. Writable Interface provides the capabilities to write the data into disk and sort it. The Number of map task will be decided based on InputSplits defined in InputFormat. The split is a logically split not physical. The MapReduce first invoke setup () method of context and then invoke map (Object, Object, Context) for each input split and at last invoke cleanup (Context) method for cleanup activity. We extend Mapper class to basic generic Mapper<Key2, Value1, Key2, Value2> class which indicate the input & out put for key and value(s). Mapper class could be overwrite map () method to process the input data as below

public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] keyvalue = line.split(",");
populat.set(Integer.parseInt(keyvalue[3]));
CompositeGroupKey cntry = new CompositeGroupKey(keyvalue[0], keyvalue[1]);
context.write(cntry, populat);
}

If you see above map method it pass Key as Object and value as Text. Value contains each line of file. Inside map method we split the line and create the Key-value and pass to intermediate area through context. Context will fill the io buffer with key-value mapper and later spills to local disk. The map task output grouped by sorted key and writes to the local disk as intermediate data. The grouping of map output defined by partition that identifies the reducer for each key. MapReduce could also provides local combiner which combine intermediate map output and pass to the reducer to cut down the amount of data transferred from the Map to the Reducer. Reducer: The Reducer copy intermediate map tasks output from local disk through http and pass to individual reducer based on each key. Before invoking individual reduce task, Reducer shuffle, merge and sort key-value. Reducer process collection of values for each key and write to the disk. Below is the reduce method which will be spawned by Reducer for each Key that means each reduce task would be having single key with collection of values

public void reduce(CompositeGroupKey key, Iterator<IntWritable> values,Context context) throws IOException, InterruptedException {
int cnt = 0;
while (values.hasNext()) {
cnt = cnt + values.next().get();
}
context.write(key, new IntWritable(cnt));
}

Job: For running the MapReduce we have to set the Mapper, Reducers and other property in JobConf. JobConf is the main configuration class, which configure MapReduce parameters such as Mapper, Reducer, Combiner, InputFormat, OutputFormat, and Comparator etc. Below code shows how to create and run the job based on above map and reduce code

Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "GroupMR");
job.setJarByClass(GroupMR.class);
job.setMapperClass(GroupMapper.class);
job.setReducerClass(GroupReducer.class);
job.setOutputKeyClass(CompositeGroupKey.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setMaxInputSplitSize(job, 10);
FileInputFormat.setMinInputSplitSize(job, 100);
FileInputFormat.addInputPath(job, new Path("/Local/data/Country.csv"));
FileOutputFormat.setOutputPath(job, new Path("/Local/data/output"));
System.exit(job.waitForCompletion(true) ? 0 : 1);

This main method calls the MapReduce Job. Before calling the job we need to set the MapperClass, ReducerClass, OutputKeyClass and OutputValueClass. We can also set the FileInputFormat and FileOutputFormat. Running: To run the MapReduce program is quite straight forward, what we need to do package the java application in JAR say hadoopessence.jar and run as below in command line hadoop jar hadoopessence.jar.jar org.hadoopessence.compositesorting /input folder and /output folder in HDFS Download: Click here to download source code Reference Hadoop Essence: The Beginner’s Guide to Hadoop & Hive

Hadoop MapReduce Group By Operation – Part1

Introduction This article is continuity of my previous articles on Hadoop MapReduce. In my previous article from my Book Hadoop Essence, I had discussed about overall MapReduce architecture and how it works plus basic Hello World program. In this article I will be discussing how to write basic MapReduce program to calculate sum based on group by operation.

You can download full code in here. Click Here to Download source code.

Input and Output This program read the file country.csv, which contains Country CD, State, City, Population (million) USA, CA, Sunnyvale, 12 USA, CA, SAN JOSE, 42 USA, MO, XY, 23 USA, MO, AB, 19 IND, TN, AT, 11 IND, TN, KL, 10 MapReduce program will process and aggregate the total population group by country and state as below Country CD, State, Total Population (million) IND, TN, 22 USA, CA, 54 USA, MO, 42

Environment Setup You can refer Setup link to setup to configure Apache Hadoop.

Project Dependency We have to add Hadoop dependency in POM.xml of Maven project. We can create a maven based Java project and add the below Hadoop core dependency in POM. In this application I am using Hadoop 1.x version.

<dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-core</artifactId>
        <version>1.2.1</version>
    </dependency>

Mapper Program Mapper program split input file and spawned map task for each splits. Map task transform input split record into Key-value pair where Key and Value should be implement WritableComparable interface. Writable Interface provides the capabilities to write the data into disk and sort it. The Number of map task will be decided based on InputSplits defined in InputFormat. The split is a logically split not physical splits. The MapReduce first invoke setup () method of context and then invoke map (Object, Object, Context) for each input split and at last invoke cleanup (Context) method for cleanup activity. We extend Mapper class to basic generic Mapper<Key2, Value1, Key2, Value2> class which indicate the input & out put key and value. Mapper class could be overwrite map () method to process the input data as below

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            String[] keyvalue = line.split(&amp;amp;quot;,&amp;amp;quot;);
            cntText.set(new Text(keyvalue[0]));
            stateText.set(keyvalue[1]);
            populat.set(Integer.parseInt(keyvalue[3]));
            Country cntry = new Country(cntText, stateText);
            context.write(cntry, populat);
        }
        

If you see above map method it pass Key as Object and value as Text. Value contains each line of file. Inside map method we split the line and create the Key-value and pass to intermediate area through context. Context will fill the io buffer with key-value mapper out and later spills to local disk. Below is the custom Country key, which I used for sorting the value country wise and then state wise. Below is country key which implements WritableComparable interface

private static class Country implements WritableComparable&amp;amp;lt;Country&amp;amp;gt; {
Text country;
        Text state;
        public Country(Text country, Text state) {
            this.country = country;
            this.state = state;
        }
        public Country() {
            this.country = new Text();
            this.state = new Text();
        }
public void write(DataOutput out) throws IOException {
            this.country.write(out);
            this.state.write(out);
        }
public void readFields(DataInput in) throws IOException {
            this.country.readFields(in);
            this.state.readFields(in);
        }
public int compareTo(Country pop) {
            if (pop == null)
                return 0;
            int intcnt = country.compareTo(pop.country);
            if (intcnt != 0) {
                return intcnt;
            } else {
                return state.compareTo(pop.state);

            }
        }
@Override
        public String toString() {
return country.toString() + &amp;amp;quot;:&amp;amp;quot; + state.toString();
        }
    }

In the above Key I have implemented compare method that will be used MapReduce program to sort the values. Below is the full mapper code

public  class GroupMapper extends Mapper&amp;amp;lt;LongWritable, Text, Country, IntWritable&amp;amp;gt; {
Country cntry = new Country();
Text cntText = new Text();
Text stateText = new Text();
IntWritable populat = new IntWritable();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
            String[] keyvalue = line.split(&amp;amp;quot;,&amp;amp;quot;);
            cntText.set(new Text(keyvalue[0]));
            stateText.set(keyvalue[1]);
            populat.set(Integer.parseInt(keyvalue[3]));
            Country cntry = new Country(cntText, stateText);
            context.write(cntry, populat);

        }
    }

The map task output grouped by sorted key writes to the local disk and intermediate data. The grouping of map output defined by partition that identifies the reducer for each key. MapReduce also provide local combiner which combine intermediate map output and pass to the reducer. It helps to cut down the amount of data transferred from the Map to the Reducer. Here org.apache.hadoop.io.Text class as key, which provides methods to searlized, and compare texts at byte level. The value IntWritable again searlized as it implements writable interface and support sorting as it implements WritableComparable interface. Reducer: The Reducer copy intermediate map tasks output from local disk through http and pass to individual reducer based on each key. Before invoking individual reduce task Reducer shuffle, merge and sort key-value. Reduce task process collection of value for single key. Reducer process collection of values for each key and writ to the disk. Reducer class implements generic Reducer class as below public static class GroupReducer extends Reducer<Country, IntWritable, Country, IntWritable> { The generic Reducer class defined input & output key-value here we are getting input in <Country, IntWritable> and spill the out as <Country, IntWritable> Below is the reduce method which will be spawned by Reducer method for each Key that means each reduce task would be having single key with collection of values.

public void reduce(Country key, Iterator&amp;amp;lt;IntWritable&amp;amp;gt; values, Context context) throws IOException,
                InterruptedException {
int cnt = 0;
            while (values.hasNext()) {
                cnt = cnt + values.next().get();
            }
            context.write(key, new IntWritable(cnt));

        }
        

Job: For running the MapReduce we have to set the Mapper, Reducers and other property in JobConf. JobConf is the main configuration class, which configure MapReduce parameters such as Mapper, Reducer, Combiner, InputFormat, OutputFormat, and Comparator etc. Below code shows how to create and run the job based on above map and reduce code

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        FileUtils.deleteDirectory(new File(&amp;amp;quot;/Local/data/output&amp;amp;quot;));
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, &amp;amp;quot;GroupMR&amp;amp;quot;);
        job.setJarByClass(GroupMR.class);
        job.setMapperClass(GroupMapper.class);
        job.setReducerClass(GroupReducer.class);
        job.setOutputKeyClass(Country.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.setMaxInputSplitSize(job, 10);
        FileInputFormat.setMinInputSplitSize(job, 100);
        FileInputFormat.addInputPath(job, new Path(&amp;amp;quot;/Local/data/Country.csv&amp;amp;quot;));
        FileOutputFormat.setOutputPath(job, new Path(&amp;amp;quot;/Local/data/output&amp;amp;quot;));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }

This main method calls the MapReduce Job. Before calling the job we need to set the MapperClass, ReducerClass, OutputKeyClass and OutputValueClass. We can also set the FileInputFormat and FileOutputFormat. FileInputFormat also use to decide number of map tasks. We can also set input split size, which take part while deciding number of map task s below. Split size can min between maxSize and blockSizeMath.min(maxSize, blockSize) As per above code, framework will execute the map-reduce job as described in Job.

Running To run the MapReduce program is quite straight forward, what we need to do package the java application in JAR say hadoopessence.jar and run as below in command line hadoop jar hadoopessence.jar.jar org.techmytalk.hadoopessence. GroupMR /input folder and /output folder in HDFS For testing we can directly run this application as Run As in eclipse.

Download Source We can download full source using download link

Summary: In this article I explained basic Map Reduce to compute Group By operation Use download link to download full source code.

Reference Hadoop Essence: The Beginner’s Guide to Hadoop & Hive

JAVA NIO – FileLock

Locking on File introduced with java.nio package that implemented by FileChannel. A FileLock is two types exclusive and shared. A shared lock allows other concurrently running program to acquire overlapping shared Lock. An Exclusive lock doesn’t allow other program to acquire overlapping lock.

Lock ideally supports shared Lock but it depends OS if OS doesn’t support share lock then it will act as exclusive locking. Locking applied on File not per channel or thread means if channel obtained the exclusive lock then other JVM’ channel will be block till first thread channel resume the lock.

Locks are associated with a file, not with individual file handles or channels Locking applied on File not per channel or thread means if channel obtained the lock before processing the file then other JVM’ channel will be block till first thread channel resume the lock. Since Lock is on File therefore it is not advisable to use lock on a single JVM.

Below is FileLock API in FileChannel class

public abstract class FileChannel extends AbstractInterruptibleChannel implements SeekableByteChannel,
GatheringByteChannel, ScatteringByteChannel {
public abstract FileLock lock(long position, long size, boolean shared) throws IOException;
public final FileLock lock() throws IOException {
return lock(0L, Long.MAX_VALUE, false);
}
public abstract FileLock tryLock(long position, long size, boolean shared) throws IOException;
public final FileLock tryLock() throws IOException {
return tryLock(0L, Long.MAX_VALUE, false);
}
}

The FileLock (long position, long size, boolean shared) acquire lock on specified part of a file. The region of file specified by beginning position and size whereas last argument specified if lock is shared lock (value True) of exclusive (value False). To obtain the shared lock, you need to open file with read permission whereas write permission require exclusive lock.

We can use lock () method to acquire lock on whole file not the specified area on file. File can grow therefore we need to specify the size while obtaining the lock.

Method tryLock on specified area or on full file doesn’t block/hold while acquiring a lock means it immediately return value without going to wait to acquire lock. If other process already acquired lock for specific file then it will immediate return null.

FileLock API specified as below.

public abstract class FileLock {
public final FileChannel channel( )
public final long position( )
public final long size( )
public final boolean isShared( )
public final boolean overlaps (long position, long size)
public abstract boolean isValid( );
public abstract void release( ) throws IOException;
}

The FileLock encapsulates specific region, which is acquired by FileChannel. FileLock associate with specific FileChannel instance and FileLock keep the reference of FileChannel that could be determining by channel () method in FileLock.

FileLock lifecycle will start when lock method or tryLock invoke from FileChannel and end when release () method called. It FileLock also invalid if JVM shutdown or associated channel closed.

isShared() method return whether lock is shared or exclusive. If shared lock is not supported by operating system then this method always return false.

FileLock objects are thread-safe; multiple threads may access a lock object concurrently.

Finally, overlaps(long position, long size) return if lock overlap the given block range or not.

FileLock object associated with an underlying file which occurred deadlock if you don’t release the lock hence it is advisable to always release lock as mentioned below

FileLock lock = fileChannel.lock( )
try {
..............
} catch (IOException) [
..................
} finally {
lock.release( )
}

JAVA NIO – Memory-Mapped File

In Java’s earlier version, It uses FileSystem conventional API to access system file. Behind the scene JVM makes read () and write () system call to transfer data from OS kernel to JVM. JVM utilize it memory space to load and process file that causes trouble on large data file processes. File page convert to JVM system before processing the file data because OS handle file in page but JVM uses byte stream that is not compatible with page. In JDK version 1.4 and above Java provides MappedByteBuffer which help to establish a virtual memory mapping from JVM space to filesystem pages. This removes the overhead of transferring and coping the file’s content from OS kernel space to JVM space. OS uses Virtual Memory to cache the file in outside of Kernel space that could be sharable with other non-kernel process. Java maps the File pages to MappedByteBuffer directly and process these file without loading into JVM.

Belo diagram show how JVM mapped the file with MappedByteBuffer and process the file without loading the file in JVM.

JAVA NIO

MappedByteBuffer directly map with open file in Virtual Memory by using map method in FileChannel. The MappedByteBuffer object work like buffer but its data stored in a file on Virtual Memory. The get() method on MappedByteBuffer fetch the data from file which will represent current file data stored inside disk. In similar way put () method update the content directly on disk and modified content will be visible to other readers of the file. Processing file through MappedByteBuffer has big advantage because it doesn’t make any system call to read/write on file that improve the latency. Apart from that File in Virtual Memory cache the memory pages that will directly access by MappedByteBuffer and doesn’t consumes JVM space. The only drawback is the it throw page fault if requested page is not in Memory.

Below some of the key benefits of MappedByteBuffer:

  1. The JVM directly process on Virtual Memory hence it will avoid system read () and write() call.
  2. JVM doesn’t load file in its memory besides it uses Virtual Memory that bring the ability to process large data in efficient manner.
  3. OS mostly take care of reading and writing from shared VM without using JVM
  4. It could be used more than one process based on locking provided by OS. We will be discussing locking in later.
  5. This also provides ability map a region or part of file.
  6. The file data always mapped with disk file data without using buffer transfer.

 

Note: JVM invokes the shared VM, page fault generated that push the file data from OS to shared memory. The other benefit of memory mapping is that operating system automatically manages VM space.

 

public abstract class FileChannel
   extends AbstractInterruptibleChannel
   implements SeekableByteChannel, GatheringByteChannel, ScatteringByteChannel
   {
   public abstract MappedByteBuffer map(MapMode mode,
           long position, long size) throws IOException;
   public static class MapMode {
   }
   }
   public static final MapMode READ_ONLY
   public static final MapMode READ_WRITE
   public static final MapMode PRIVATE

As per above code map () method on FileChannel pass the arguments as mode, position and size and return MappedByteBuffer for part of file. We can also map entire file by using as below

buffer =
fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileChannel.size());

If you increase the size larger than file size then file will also become large to match the size of MappedByteBuffer in write mode but throw IOException in read mode because file can not modified in read mode. There are three type of mapping mode in map() method

  • READ_ONLY: Read only
  • READ_WRITE: Read and update the file
  • PRIVATE: Change in MappedByteBuffer will not reflect to File.

MappedByteBuffer will also not be visible to other programs that have mapped the same file; instead, they will cause private copies of the modified portions of the buffer to be created.

Map()method will throw NonWritableChannelException if it try to write on MapMode.READ_WRITE mode. NonReadableChannelException will be thrown if you request read on channel not open on read mode.

 

Below sample code show how to use MappedByteBuffer

public class MemoryMappedSample {
  public static void main(String[] args) throws Exception {
       int count = 10;
       RandomAccessFile memoryMappedFile = new RandomAccessFile("bigFile.txt", "rw");
       MappedByteBuffer out = memoryMappedFile.getChannel().map(FileChannel.MapMode.READ_WRITE, 0, count);
       for (int i = 0; i < count; i++) {
           out.put((byte) 'A');
       }
       for (int i = 0; i < count; i++) {
           System.out.print((char) out.get(i));
       }
   }
}

JAVA NIO – Channel

A channel uses ByteBuffer to transfer I/O from source buffers to target buffers. We can pass the data into Buffer, which send to Channel or Channel push the data into Buffer back.

Channels provide direct connection with IO to transport data from OS ByteBuffer and File or Sockets. Channel uses buffers to send and receive data, which is compatible with OS ByteBuffer and minimize the overhead to access the OS’s filesystem.

A channel provides open connection to I/O such as network socket, file etc. to read and write operations. Channel could not directly open but could be open by calling open method on RandomAccessFile, FileInputStream, or FileOutputStream object. A Channel interface provides close method and once channel closed it could not able to reopen and throws exception ClosedChannelException.

Below is the Channel interface

public interface Channel extends Closeable {
public boolean isOpen();
public void close() throws IOException;
}

Channel interface depends upon operating system native calls provide by OS provider. It provides isOpen () to check is channel is open and close () to close the open channel,

Below diagram shows overall package diagram

JAVA NIO Channel

The InterruptibleChannel interface is marker that could be asynchronously closed and interrupted, means if a thread is blocked in an I/O operation on as interruptible channel then thread may invoke close method asynchronously.

WritableByteChannel interface that extends Channel interface provide write (ByteBuffer src) method to write a sequence of bytes to this channel from the given buffer.

RedableByteChannel interface have read (ByteBuffer) method that reads sequence of bytes from channel into the buffer passed as argument. Only one thread can invoke read operation and if other thread initiate that time it will block until the first operation is complete.

The NIO provide concretes interfaces with various implementation in SPI package which could be change by different provider. SelectableChannel uses Selector to multiplex ByteBuffer. Package java.nio.channels.spi provide implementation of various channel. For instance AbstractInterruptibleChannel and AbstractSelectableChannel, provide the methods needed by channel implementations that are interruptible or selectable, respectively.

A Channel handle communicates with I/O in both direction similar to InputStream and OutputStream. It is capable to handle concurrent write & read, means it doesn’t block I/O to be read or write.

Channels Creations

There are two types of channel File and Sockets. Sockets channels are three types SocketChannel, ServerSocketChannel and DatagramChannel. Channel could be created in several ways for instance socket channels have factory method to create new socket channels whereas FileChannel could be created by calling getChannel () in IO Stream / RandomAccessFile.

SocketChannel socketchannel = SocketChannel.open( );
socketchannel.connect (new InetSocketAddress ("host", someport));
ServerSocketChannel channel = ServerSocketChannel.open( );
ssc.socket( ).bind (new InetSocketAddress (somelocalport));
DatagramChannel channel = DatagramChannel.open( );
RandomAccessFile randaccfile = new RandomAccessFile ("file", "r");
FileChannel channel = randaccfile.getChannel( );

Channels could be unidirectional or bidirectional for instance if class implements ReadableByteChannel and WritableByteChannel both then this class will be bidirectional channel but if class implements any one of them then it will be unidirectional channel.

Selector

Selector provides capabilities to process single IO operation across multiple buffers. Selector class uses multiplex to merge multiple stream into single stream and DE-multiplex to separate single stream to multiple stream. For example write data is gathered from multiple buffers and sent to Channel.

Selector operation intern call OS native call to fill or drain data directly without directly copying to buffer.

public interface ScatteringByteChannel extends ReadableByteChannel {
public long read(ByteBuffer[] dsts, int offset, int length) throws IOException;
public long read(ByteBuffer[] dsts) throws IOException;
}

public interface GatheringByteChannel extends WritableByteChannel {
public long write(ByteBuffer[] srcs) throws IOException;
}
public interface WritableByteChannel extends Channel {
public int write(ByteBuffer src) throws IOException;

}

JAVA NIO – Buffer

A Buffer is abstract class of java.nio, which contains fixed amount of data. It can store data and later retrieve those data. Buffer has key three properties:

  • Capacity: A buffer’s capacity is the number of elements it contains. It is constant number that doesn’t change
  • Limit: A buffer limits is the index of the first element that should not be read or written. It always less than buffer’s capacity.
  • Position: A buffer position is the index of next element to be read from the buffer.

There is one subclass of this class for each non-boolean primitive type e.g. CharBuffer, IntBuffer etc. Buffer is abstract class that extended by non-Boolean primitive that provide common behavior across the various Buffers as shown below.

JAVA NIO Buffer

A buffer is a linear, finite sequence of elements of a specific primitive type wrapped inside an object. It constitutes date content and information about the data into single object. Data may transfer in to or out of the buffer by the I/O operations by a channel at the specified position. The I/O operation can read and write at the current position and then increment the position by the number of elements transferred. Buffer throws BufferUnderflowException and BufferOverflowException if position exceed while get operation or put operation respectively. The position field of Buffer specified the position to retrieve or insert the data element inside Buffer. Limit fields of Buffer indicates the end of the buffer which can be set using below operation

public final Buffer limit(int newLimit)

We can also drain the buffer by using flip method as below operation

public final Buffer flip()

Flip method set the limit to the current position and position to 0.The rewind () method is similar to flip () but does not affect the limit. It only sets the position back to 0. You can use rewind () to go back and reread the data in a buffer that has already been flipped.

Byte Buffers:

ByteBuffer is similar to OS ByteBuffer that could be mapped to OS’s ByteBuffer without any translation. ByteBuffer uses Byte core unit to read and write IO data, which is more significant, compare to other primitive data type buffers.

Byte buffers can be created either by allocation, which allocates space for the buffer’s content, or by wrapping an existing byte array into a buffer.

ByteBuffer are two type Direct Buffer and Indirect Buffer.

Direct & Indirect Buffers

ByteBuffer are two types Direct Buffer and Indirect Buffers. The key difference between direct buffer and indirect buffer is that direct buffer could directly access native IO calls whereas indirect buffers could not. Direct buffer access file data directly by using native I/O operation to fill or drain byte buffer. Direct buffer is memory consuming but other side it provided most efficient I/O mechanism. It doesn’t copy the buffer’s content to an intermediate buffer or vice versa.

Indirect buffer could also be used to pass the data but it could not directly uses native I/O operation upon it. Indirect buffer indirectly uses temporary direct buffer to access file IO data.

A direct buffer could be created by allocateDirect factory method, which is more expensive, and therefore it is advisable to use direct buffers only when they yield a measureable gain in program performance.

Whether a byte buffer is direct or non-direct may be determined by invoking its isDirect () method. This method is provided so that explicit buffer management can be done in performance-critical code

JAVA NIO – Introduction

Operating System allocates memory to JVM to process its task. In old JDK (>1.4), JVM uses FileSystem API to access file from hard disk which is quite burden to JVM because there is no direct reference between JVM and File inside hard disk. JVM uses Operating System system calls to access file. Operating System stores files into large ByteBuffer, which is quite large compare to byte stream used by JVM. JVM uses extra efforts to convert ByteBuffer to byte stream and vice versa.

So there were two key challenges in existing old JDK

  • It could not access file directly from disk
  • It has to do extra efforts to convert files data to byte stream

Following are the key steps to access Files

  1. Operating System allocate memory to JVM
  2. Client invoke FileSystem to access specific file from OS
  3. JVM make OS system call to access to File data
  4. JVM get OS’s data as ByteBuffer and converts it into Byte Stream

For large file OS uses Virtual Memory to store data outside of RAM. The benefit of Virtual Memory is that it is sharable across multiple processes and hence VM could be accesses by OS and JVM both. From transferring data from OS to VM could be quite fast by using DMA (Direct Memory Access) whereas transferring data from VM to JVM is slow because JVM does extra efforts to break large data buffer to byte stream.

JAVA FileSystem: Java uses FileSystem API to access physical storage inside system. When client try to access a particular file via FileSystem, FileSystem identify storage location and load those disk stores into memory.

File’s data stores into multiple pages and page contain group of block. Kernel establishes mapping between memory pages and filesystem pages.

The Virtual memory read paging content from disk and uses page fault to synchronize the file data to Virtual Memory. Once pageins completed FileSystem read the file contents and its Meta information

JAVA NIO

JAVA NIO that introduce JDK 1.4 and keep enhancing on newer version improve the I/O operations. It provides new type of buffers such as ByteBuffer, CharBuffer, and IntBuffer etc. that reduce the overhead during transferring data from OS to JVM. Java NIO could be able to map directly from VM to JVM bye using new ByteBuffer. JAVA ByteBuffer is same as OS Byte Buffer that’s why it could be easily mapped from OS Byte Buffer to JVM Byte Buffer so less overhead on data conversion from OS to JVM.

JAVA NIO2

As per above diagram if OS uses VM and JVM use newly NIO interface, it enhances the performance while processing file especially large file. We will also discuss later there are another MappedByteBuffer which have capabilities to directly process the file in VM without transferring data from VM to JVM.