hdfs - w3toppers.com

Default Namenode port of HDFS is 50070.But I have come across at some places 8020 or 9000 [closed]

The default Hadoop ports are as follows: (HTTP ports, they have WEB UI): Daemon Default Port Configuration Parameter ———————– ———— ———————————- Namenode 50070 dfs.http.address Datanodes 50075 dfs.datanode.http.address Secondarynamenode 50090 dfs.secondary.http.address Backup/Checkpoint node? 50105 dfs.backup.http.address Jobracker 50030 mapred.job.tracker.http.address Tasktrackers 50060 mapred.task.tracker.http.address Internally, Hadoop mostly uses Hadoop IPC, which stands for Inter Process Communicator, to communicate amongst … Read more

Spark iterate HDFS directory

You can use org.apache.hadoop.fs.FileSystem. Specifically, FileSystem.listFiles([path], true) And with Spark… FileSystem.get(sc.hadoopConfiguration).listFiles(…, true) Edit It’s worth noting that good practice is to get the FileSystem that is associated with the Path‘s scheme. path.getFileSystem(sc.hadoopConfiguration).listFiles(path, true)

Hive: Add partitions for existing folder structure

Use msck repair table command: MSCK [REPAIR] TABLE tablename; or ALTER TABLE tablename RECOVER PARTITIONS; if you are running Hive on EMR. Read more details about both commands here: RECOVER PARTITIONS

Namenode not getting started

I was facing the issue of namenode not starting. I found a solution using following: first delete all contents from temporary folder: rm -Rf <tmp dir> (my was /usr/local/hadoop/tmp) format the namenode: bin/hadoop namenode -format start all processes again:bin/start-all.sh You may consider rolling back as well using checkpoint (if you had it enabled).

The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw- (on Windows)

First of all, make sure you are using correct Winutils for your OS. Then next step is permissions. On Windows, you need to run following command on cmd: D:\winutils\bin\winutils.exe chmod 777 D:\tmp\hive Hope you have downloaded winutils already and set the HADOOP_HOME variable.

Permission denied at hdfs

I solved this problem temporary by disabling the dfs permission.By adding below property code to conf/hdfs-site.xml <property> <name>dfs.permissions</name> <value>false</value> </property>

Amazon s3a returns 400 Bad Request with Spark

This message correspond to something like “bad endpoint” or bad signature version support. like seen here frankfurt is the only one that not support signature version 2. And it’s the one I picked. Of course after all my reserch can’t say what is signature version, it’s not obvious in the documentation. But the V2 seems … Read more

data block size in HDFS, why 64MB?

What does 64MB block size mean? The block size is the smallest data unit that a file system can store. If you store a file that’s 1k or 60Mb, it’ll take up one block. Once you cross the 64Mb boundary, you need a second block. If yes, what is the advantage of doing that? HDFS … Read more

Spark on yarn concept understanding

Adding to other answers. Is it necessary that spark is installed on all the nodes in the yarn cluster? No, If the spark job is scheduling in YARN(either client or cluster mode). Spark installation is needed in many nodes only for standalone mode. These are the visualizations of spark app deployment modes. Spark Standalone Cluster … Read more

Why is the final reduce step extremely slow in this MapReduce? (HiveQL, HDFS MapReduce)

If final reducer is a join then it looks like skew in join key. First of all check two things: check that b.f1 join key has no duplicates: select b.f1, count(*) cnt from B b group by b.f1 having count(*)>1 order by cnt desc; check the distribution of a.f1: select a.f1, count(*) cnt from A … Read more