Spark Scala list folders in directory

We are using hadoop 1.4 and it doesn’t have listFiles method so we use listStatus to get directories. It doesn’t have recursive option but it is easy to manage recursive lookup. val fs = FileSystem.get(new Configuration()) val status = fs.listStatus(new Path(YOUR_HDFS_PATH)) status.foreach(x=> println(x.getPath))

Default Namenode port of HDFS is 50070.But I have come across at some places 8020 or 9000 [closed]

The default Hadoop ports are as follows: (HTTP ports, they have WEB UI): Daemon Default Port Configuration Parameter ———————– ———— ———————————- Namenode 50070 dfs.http.address Datanodes 50075 dfs.datanode.http.address Secondarynamenode 50090 dfs.secondary.http.address Backup/Checkpoint node? 50105 dfs.backup.http.address Jobracker 50030 mapred.job.tracker.http.address Tasktrackers 50060 mapred.task.tracker.http.address Internally, Hadoop mostly uses Hadoop IPC, which stands for Inter Process Communicator, to communicate amongst … Read more

Spark iterate HDFS directory

You can use org.apache.hadoop.fs.FileSystem. Specifically, FileSystem.listFiles([path], true) And with Spark… FileSystem.get(sc.hadoopConfiguration).listFiles(…, true) Edit It’s worth noting that good practice is to get the FileSystem that is associated with the Path‘s scheme. path.getFileSystem(sc.hadoopConfiguration).listFiles(path, true)

Parallel Algorithms for Generating Prime Numbers (possibly using Hadoop’s map reduce)

Here’s an algorithm that is built on mapping and reducing (folding). It expresses the sieve of Eratosthenes      P = {3,5,7, …} \ U {{p2, p2+2p, p2+4p, …} | p in P} for the odd primes (i.e without the 2). The folding tree is indefinitely deepening to the right, like this: where each prime number … Read more

Easiest way to install Python dependencies on Spark executor nodes?

Actually having actually tried it, I think the link I posted as a comment doesn’t do exactly what you want with dependencies. What you are quite reasonably asking for is a way to have Spark play nicely with setuptools and pip regarding installing dependencies. It blows my mind that this isn’t supported better in Spark. … Read more

How can I force Spark to execute code?

Short answer: To force Spark to execute a transformation, you’ll need to require a result. Sometimes a simple count action is sufficient. TL;DR: Ok, let’s review the RDD operations. RDDs support two types of operations: transformations – which create a new dataset from an existing one. actions – which return a value to the driver … Read more