Wednesday, August 11, 2010

Accessing Local Files

In most examples of Hadoop code there is no reason to access a local file system. All of the data is passed using the standard map and reduce methods. Indeed it is usually a bad idea to access a local filesystem on a slave processor because that data will not be persisted from one processing step to the next. Sometimes, however, these rules have to change.

One case where it is particularly necessary to access a local filesystem is where a critical step in either the mapper or the reducer involves launching a separate process where the application assumes the existence of certain local files. Normally when Hadoop uses an external process it uses Hadoop streaming which assumes that the external process takes all of its data from standard in and sends all of its output to standard out. These assumptions may fail under several conditions. First, the external process may require more than one input. For example, one or more configuration files may be required. In addition, it assumes that the developer has sufficient control over the external process and the way it functions to make it compatible with Hadoop streaming.

In many cases these assumptions may be unrealistic. Nothing prevents a custom mapper or reducer from writing appropriate files on the local filesystem for an external program and after watching that program reading any output files that have been written.

There are two ways to get files to the local filesystem of a slave processor. One is to use Hadoop's distributed cache which will send files specified in the job configuration to each slaves local file system. The distributed cache will be a topic of another blog entry. This entry will concentrate on reading and writing local files. The alternative is to have these slave process right the files directly to the file system. Files which will be required during all steps of processing may be written to a local filesystem during the setup phase. Files required only for a single step of processing may be written during that step and, if no longer required, deleted at the end of that step.

. Hadoop supplies a LocalFileSystem object which manages the relationship to the local file system. The code below shows how to get a LocalFileSystem  given a Hadoop context.

Configuration configuration = context.getConfiguration();
LocalFileSystem localFs = FileSystem.getLocal(configuration);

The LocalFileSystem  has methods create, delete, open and append to files on the local filesystem. Each file is designated by a Path. In my work I have made these Paths relative since I am uncertain about where a program is running or what permissions are available on a slave processer.

 

The following code is a set of static utility routines that write to the local filesystem. I consider three cases in the first the data is a string, possibly the contents of a Text object passed in. In the second, the contents are a resource passed in with a custom jar file. Resources are very convenient when data must be passed to every instance and where the data is not large relative to the size of the jar file. Both of these end up calling a routine which writes the contents of an InputStream to the local file system. This allows a third possibility where the data source is anything that can supply an input stream, very specifically Web services and other data sources. Will I

/**
* write a resource to a  LocalFileSystem
* @param cls - class holding the resource
* @param resourceName - !null name of the resource
* @param localFs - !null file system
* @param dstFile - !null local file name - this will become a path
*/
public static void writeResourceAsFile(Class cls,String resourceName, LocalFileSystem localFs, String dstFile) {
     InputStream inp = cls.getResourceAsStream(resourceName);
     writeStreamAsFile(localFs, dstFile, inp);
}

/**
* Write the contents of a stream to the local file system
* @param localFs  - !null file system
* @param dstFile - !null local file name - this will become a path
* @param pInp - !null open Stream
*/
public static void writeStreamAsFile(final LocalFileSystem localFs, final String dstFile, final InputStream pInp) {
    Path path = new Path(dstFile);
    try {
        FSDataOutputStream outStream = localFs.create(path);
        copyFile(pInp, outStream);
    }
    catch (IOException e) {
        throw new RuntimeException(e);

    }
}

/**
* Write the contents of a String to the local file system
* @param localFs  - !null file system
* @param dstFile - !null local file name - this will become a path
  * @param s  !null String
*/
public static void writeStringAsFile(final LocalFileSystem localFs, final String dstFile, final String s) {
    ByteArrayInputStream inp = new ByteArrayInputStream(s.getBytes());
     writeStreamAsFile(localFs,dstFile, inp);
}

/**
* copy an  InputStream to an outStream
* @param inp - !null open Stream it will be closed at the end
* @param outStream !null open Stream it will be closed at the end
* @return  true on success
*/
public static boolean copyFile(InputStream inp, FSDataOutputStream outStream) {
     int bufsize = 1024;
     try {
         // failure - no data

         int bytesRead = 0;
         byte[] buffer = new byte[bufsize];
         while ((bytesRead = inp.read(buffer, 0, bufsize)) != -1) {
             outStream.write(buffer, 0, bytesRead);
         }
         inp.close();
         outStream.close();
         return true;
     }
     catch (IOException ex) {
         return (false);
     }
}

4 comments:

  1. Your blog has given me that thing which I never expect to get from all over the websites. Nice post guys!

    Web Developer Melbourne

    ReplyDelete


  2. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai

    ReplyDelete