2022-09-02 09:26:29

参考

MapReduce怎样读取本地目录的文件
 MapReduce怎样读取本地目录的文件
 Hadoop-2.4.1学习之FileSystem及实战
 FileSystem
DistributedCache
Hadoop DistributedCache详解
 Hadoop DistributedCache使用及原理
 Hadoop DistributedCache is deprecated-what is the preferred API?
Hadoop进阶之输入路径如何正则通配？
[Hadoop]输入路径过滤，通配符与PathFilter

读文件

在客户端提交mapreduce任务的时候可以读取本地文件，当执行mapreduce任务的过程中map和reduce是分发到不同节点的，无法获取客户端本地文件，只能访问运行所在节点的本地文件，或者使用hadoop的DistributedCache将需要使用的文件缓存在共享存储中供mapreduce运行中使用。

在main中读本地文件

使用FileSystem.getLocal(conf)
也可以在mian中先将文件put到hdfs再在程序中读取

 LocalFileSystem fs = FileSystem.getLocal(conf);
        try {
            BufferedReader reader = new BufferedReader(new InputStreamReader(fs.open(new Path(conf.get(FILE_MAINCONFIG)))));
            String tmp;
            while ((tmp = reader.readLine()) != null) {
                //按行处理
                System.out.println(tmp);
            }
            reader.close();
        }catch (IOException e) {
            e.printStackTrace();
            System.exit(1);
        }

setup中读hdfs文件

使用FileSystem.get(conf)

 String hdfspath = conf.get(Driver.FILE_HDFSCONFIG);
        try {
            BufferedReader reader = new BufferedReader(new InputStreamReader(FileSystem.get(conf).open(new Path(hdfspath))));
            String tmp;
            while ((tmp = reader.readLine()) != null) {
                //按行处理
                System.out.println(tmp);
            }
            reader.close();
        } catch (IOException e) {
        e.printStackTrace();
        System.exit(1);
        }

DistributedCache

Hadoop提供了两种DistributedCache使用方式，一种是通过API，在程序中设置文件路径，另外一种是通过命令行（-files，-archives或-libjars）参数告诉Hadoop，该方式可使用以下三个参数设置文件：
-files：将指定的本地/hdfs文件分发到各个Task的工作目录下，不对文件进行任何处理；
-archives：将指定文件分发到各个Task的工作目录下，并对名称后缀为“.jar”、“.zip”，“.tar.gz”、“.tgz”的文件自动解压，默认情况下，解压后的内容存放到工作目录下名称为解压前文件名的目录中，比如压缩包为dict.zip,则解压后内容存放到目录dict.zip中。为此，你可以给文件起个别名/软链接，比如dict.zip#dict，这样，压缩包会被解压到目录dict中。
-libjars：指定待分发的jar包，Hadoop将这些jar包分发到各个节点上后，会将其自动添加到任务的CLASSPATH环境变量中。
hadoop streaming里的-files就是通过这种方式的。

// in main:
Job job = new Job();
//这个filname似乎应该要是hdfs上的文件，然后分发到各个计算节点
job.addCacheFile(new Path(filename).toUri());
//in mapper code:
Path[] localPaths = context.getLocalCacheFiles();
...

//In your driver, use the Job.addCacheFile()
public int run(String[] args) throws Exception {
    Configuration conf = getConf();
    Job job = Job.getInstance(conf, "MyJob");
    job.setMapperClass(MyMapper.class);
    // ...
   
    job.addCacheFile(new URI("/user/yourname/cache/some_file.json#some"));
    job.addCacheFile(new URI("/user/yourname/cache/other_file.json#other"));
    return job.waitForCompletion(true) ? 0 : 1;
}
//And in your Mapper/Reducer, override the setup(Context context) method:
@Override
protected void setup(
        Mapper<LongWritable, Text, Text, Text>.Context context)
        throws IOException, InterruptedException {
    if (context.getCacheFiles() != null
            && context.getCacheFiles().length > 0) {
        File some_file = new File("./some");
        File other_file = new File("./other");
        // Do things to these two files, like read them
    }
    super.setup(context);
}

路径读取

mapreduce中设置路径问题

HDFS系统的路径默认是支持正则过滤的，所以在我们写mapreduce过程中设置输入路径完全可以直接填写正则路径
FileInputFormat.setInputDirRecursive(job, true);//设置可以递归读取目录
FileInputFormat.addInputPath(job, new Path(“path1”)); //每次添加一个路径
FileInputFormat.addInputPaths(job, “path1,path2,path3,path…”); //每次添加多个路径
FileInputFormat.setInputPaths(job, new Path(“path1”),new Path(“path2”)); //
FileInputFormat.setInputPaths(job, “path1,path2,path3,path…”); //覆盖原来的路径
FileInputFormat： http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache.hadoop.fs.Path)

hadoop中的FileStatus类是支持路径通配的

String basepath="/user/d1/DataFileShare/Search/2015*/*/pv";    
//获取globStatus  
FileStatus[] status = fs.globStatus(new Path(basepath));  
for(FileStatus f:status){  
    //打印全路径，  
    System.out.println(f.getPath().toString());  
    //打印最后一级目录名  
    //System.out.println(f.getPath().getName());  
}

在globStatus的方法里，提供了一个路径重载，根据PathFilter类，通过正则再次过滤出我们需要的文件即可，使用此类，我们可以以更灵活的方式，操作，过滤路径

如果只是简单的路径过滤，直接在路径中使用正则即可；
如果路径较为复杂，可以自定义PathFilter来过滤代码。