Monday, July 27, 2015

Import CSV File Into MySQL Table



If you want to import a csv file (where the values are separated by ';') into a mysql table you need to login in the mysql and enter next command:


LOAD DATA INFILE 'csv file path' 
INTO TABLE 'name table' 
FIELDS TERMINATED BY ';' 
ENCLOSED BY '"' 
LINES TERMINATED BY '\n';


It is possible that you will need to give permissions to the csv file:

$ chmod 777 'csv file path'



Thursday, July 02, 2015

Apache Pig


Apache Pig is a platform for analyzing large data sets. 


 StepTask       Command Result
 1 Download Apache Pig 0.15.0Point browser to http://mirror.symnds.com/software/Apache/pig/pig-0.15.0/
 2 Copy to HADOOP_HOMEcd $HADOOP_HOME 
mv ~/Downloads/pig-0.15.0.tar.gz to $HADOOP_HOME
 3 Extract pig-0.15.0.tar.gz tar -xzvf pig-0.15.0.tar.gz Should see pig-0.15.0 directory
 4Add PIG_HOME and update PATH  Add the following lines to ~/.bash_profile

export PIG_HOME=$HADOOP_HOME/pig-0.15.0

export PATH=$PATH:$PIG_HOME/bin


source ~/.bash_profile
 5  Make sure pig is setup properlypig -version Should see Apache Pig version 0.15.0 o n the console
6Now you can play with pig


The Pig execution environment has two modes:


  • Local mode: All scripts are run on a single machine. Hadoop MapReduce and HDFS are not required.


  • Hadoop: Also called MapReduce mode, all scripts are run on a given Hadoop cluster.


Pig programs can be run in three different ways, all of them compatible with local and Hadoop mode:


  • Script: Simply a file containing Pig Latin commands, identified by the .pig suffix (for example, file.pig or myscript.pig). The commands are interpreted by Pig and executed in sequential order.


  • Grunt: Grunt is a command interpreter. You can type Pig Latin on the grunt command line and Grunt will execute the command on your behalf. This is very useful for prototyping and “what if” scenarios.


  • Embedded: Pig programs can be executed as part of a Java program.


Now that we have installed Apache Pig, we can play with it. I used it to filter large log files and get the information that I needed, in this case I was interested only in the errors. You can create a file with the extension .pig:

messages = LOAD '$input';
out = FILTER messages BY $0 MATCHES '^+.*error+.*';
STORE out INTO '$output';


Here you can see a small list of operators that you can use in Pig:

OperatorDescription
FILTERSelect a set of tuples from a relation based on a condition.
FOREACHIterate the tuples of a relation, generating a data transformation.
GROUPGroup the data in one or more relations.
JOINJoin two or more relations (inner or outer join).
LOADLoad data from the file system.
ORDERSort a relation based on one or more fields.
SPLITPartition a relation into two or more relations.
STOREStore data in the file system.

Wednesday, May 20, 2015

Apache Mahout: Clustering Evaluation


 In this post I would like to share the steps that I followed trying to evaluate the quality of the clusters. Feel free to add comments.

Mahout has some implementations for internal cluster evaluation.  The goal is calculate the intra-cluster and inter-cluster density once you have applied Kmeans clustering for example.  

Only as a reminder we will give a brief definition of these concepts:


Intra-cluster distance: the distance between members of a cluster.

Inter-cluster distance: the distance between all pairs of centroids.

It can be useful to determine if the clustering results are or not good.

Intra-cluster distance should be small compared to Inter-cluster distances. The target is to get clusters where members of the same cluster are close to each other and the distance between centroids is larger.

Here you can see the code that I was using to test my results:


        Configuration conf = new Configuration();
        // output after apply clustering
        Path output = new Path("/clustering_output");  
        DistanceMeasure measure = new CosineDistanceMeasure();
int numIterations = 10;
        Path clustersIn = new Path(output, "clusters-1-final");
        try {
            RepresentativePointsDriver.run(conf, clustersIn, new Path(output, "clusteredPoints"), output, measure,numIterations, true);
            ClusterEvaluator evaluator = new ClusterEvaluator(conf, clustersIn);
            // Computes the average intra-cluster density as the average of each cluster's intra-cluster density
            System.out.println("Intra-cluster density = " + evaluator.intraClusterDensity());
           // Computes the inter-cluster density as defined in "Mahout In Action"
            System.out.println("Inter-cluster density = " + evaluator.interClusterDensity());
        } catch (InterruptedException e) {

        } catch (ClassNotFoundException e) {

        }


The book Mahout in Action provides a way to evaluate the quality of the clusters:


        /* MAHOUT IN ACTION */
        DistanceMeasure measure = new CosineDistanceMeasure();
        String inputFile = "clustering_output/" + cluster_final + "/part-r-00000";
        Path path = new Path(inputFile);
        System.out.println("Input Path: " + path);
        FileSystem fs = FileSystem.get(path.toUri(), conf);
        List<Cluster> clusters = new ArrayList<Cluster>();
        SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);
        try {

            Writable key = (Writable) reader.getKeyClass().newInstance();
            ClusterWritable value = (ClusterWritable) reader.getValueClass().newInstance();

            while (reader.next(key, value)) {
                Cluster cluster = (Cluster) value.getValue();
                clusters.add(cluster);
                value = (ClusterWritable) reader.getValueClass().newInstance();
            }
            double max = 0;
            double min = Double.MAX_VALUE;
            double sum = 0;
            int count = 0;
            for (int i = 0; i < clusters.size(); i++) {
                for (int j = i + 1; j < clusters.size(); j++) {
                    double d = measure.distance(clusters.get(i).getCenter(),
                      clusters.get(j).getCenter());
                    min = Math.min(d, min);
                    max = Math.max(d, max);
                    sum += d;
                    count++;
                }
            }
            System.out.println("Maximum  Intercluster Distance: " + max);
            System.out.println("Minimum Intercluster Distance: " + min);
            double density = (sum / count - min) / (max - min);
            System.out.println("Scaled Inter-Cluster Distance: "+ density);

        } catch (InstantiationException e)   {

        } catch (IllegalAccessException e) {

        }

Monday, May 11, 2015

MySQL: How do you mysqldump specific table(s)?


If you need to dump tables t1 and t2 from the database "my_database" you need to run next command:


mysqldump -u "your user" -p my_database t1 t2 > dump_example.sql

Monday, January 12, 2015

GlassFish: Connection Pool Setup for PostgreSQL


In this post we are gonna explain how to make a JDBC data source registered over JNDI to access it from a web app.  After a DataSource object is registered within JNDI, an application can use the JNDI API to access it and connect to the datasource it represents.

You need install PostgreSQL and Glassfish, in this example I installed:  

glassfish 4.0
psql (PostgreSQL) 9.1.14



We will explain the required settings step-by-step:
    • Copy the previous driver into [glassfish_home]/glassfish/domains/domain1/lib/. In this example /opt/glassfish4/glassfish/domains/domain1/lib/
    • Access to the administration panel in Glassfish (http://localhost:4848). Go to the tab Resources/JDBC/JDBC Connection Pools:
      • Create a new connection pool and select the resource type javax.sql.DataSource, database vendor PostgreSQL and click next.
      • As datasource classname enter org.postgresql.ds.PGSimpleDataSource and insert the next additional properties:
                      User
                      DatabaseName
                      Password
                      ServerName (127.0.0.1)
                      Url  (something similar to this one: jdbc:postgresql://127.0.0.1:5432/preferences)
                      PortNumber (5432 is the default port)

      • You can test the connection using the button "Ping".
    • In order to be able to use this connection pool in java applications, it needs a JNDI registration. Go to the tab Resources/JDBC/JDBC Resources and configure the new resource. Set the JNDI name jdbc/[database_name] and select the connection pool previously created. This JNDI name will be used by applications to access the PostgreSQL database.



    Thursday, January 08, 2015

    Importing data from a CSV file into a Postgres table


    You have a CSV file called “data.csv”. The values of the different columns are delimited using “,” You want to import this data into Postgres, in a table called 'example'

    In postgres execute this command:

    COPY example FROM '/path/to/data.csv' WITH DELIMITER ',' CSV HEADER;

    If you have in the table the columns user_id, item_id and rating and in your csv you have:

    2,5,3.0
    2,123, 4.0

    Then in your table you will have:

    user_id   item_id    rating
    2               5           3.0
    2              123        4.0