-
Walid Megherbi authoredf27aea54
node2vec on spark
This library is a implementation using scala for running on spark of node2vec as described in the paper:
node2vec: Scalable Feature Learning for Networks. Aditya Grover and Jure Leskovec. Knowledge Discovery and Data Mining, 2016.
The node2vec algorithm learns continuous representations for nodes in any (un)directed, (un)weighted graph. Please check the project page for more details.
Building node2vec_spark
In order to build node2vec_spark, use the following:
$ git clone https://github.com/Skarface-/node2vec.git
$ mvn clean package
and requires:
Maven 3.0.5 or newer
Java 7+
Scala 2.10 or newer.
This will produce jar file in "node2vec_spark/target/"
Examples
This library has two functions: randomwalk and embedding.
These were described in these papers node2vec: Scalable Feature Learning for Networks and Efficient Estimation of Word Representations in Vector Space.
Random walk
Example:
./spark-submit --class com.navercorp.Main \
./node2vec_spark/target/node2vec-0.0.1-SNAPSHOT.jar \
--cmd randomwalk --p 100.0 --q 100.0 --walkLength 40 \
--input <input> --output <output>
Options
Invoke a command without arguments to list available arguments and their default values:
--cmd COMMAND
Functions: randomwalk or embedding. If you want to execute all functions "randomwalk" and "embedding" sequentially input "node2vec". Default "node2vec"
--input [INPUT]
Input edgelist path. The supported input format is an edgelist: "node1_id_int node2_id_int <weight_float, optional>"
--output [OUTPUT]
Random paths path.
--walkLength WALK_LENGTH
Length of walk per source. Default is 80.
--numWalks NUM_WALKS
Number of walks per source. Default is 10.
--p P
Return hyperparaemter. Default is 1.0.
--q Q
Inout hyperparameter. Default is 1.0.
--weighted Boolean
Specifying (un)weighted. Default is true.
--directed Boolean
Specifying (un)directed. Default is false.
--degree UPPER_BOUND_OF_NUMBER_OF_NEIGHBORS
Specifying upper bound of number of neighbors. Default is 30.
--indexed Boolean
Specifying whether nodes in edgelist are indexed or not. Default is true.
-
If "indexed" is set to false, node2vec_spark index nodes in input edgelist, example:
unindexed edgelist:
node1 node2 1.0
node2 node7 1.0indexed:
1 2 1.0
2 3 1.01 node1
2 node2
3 node7
Input
The supported input format is an edgelist: