Running Hadoop on Ubuntu in Pseudo-Distributed Mode

I tried to run Hadoop on Ubuntu in pseudo-distributed mode today, following are the detailed steps:

Install Ubuntu 11.10 i386 in VirtualBox. In this release, JDK is located in /usr/lib/jvm/java-6-openjdk by default.

Add a dedicated Hadoop user account for running Hadoop

sudo addgroup hadoop
sudo adduser --ingroup hadoop hadoop

Configure SSH for Hadoop user

sudo apt-get install ssh
su - hadoop
ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Download latest stable release of Hadoop from Hadoop’s homepage. I downloaded release 1.0.2 in a gzipped tar file (hadoop-1.0.2-tar.gz). Then uncompress the hadoop-1.0.2.tar.gz.

tar zxvf hadoop-1.0.2.tar.gz
mv hadoop-1.0.2 hadoop

Configure Hadoop

The $HADOOP_INSTALL/hadoop/conf directory contains some configuration files for Hadoop.

hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk

core-site.xml

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

hdfs-site.xml

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

mapred-site.xml

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

Format the HDFS filesystem

bin/hadoop namenode -format

Start your single-node cluster

bin/start-all.sh

Run the WordCount example job

bin/hadoop fs -copyFromLocal /home/hadoop/test_wc.txt test_wc.txt
bin/hadoop fs -ls
bin/hadoop jar hadoop-examples-1.0.2.jar wordcount test_wc.txt test_wc-output
bin/hadoop fs -cat test_wc-output/part-r-00000
bin/hadoop fs -copyToLocal test_wc-output /home/hadoop/test_wc-output

Stop your single-node cluster

bin/stop-all.sh

References:

Hadoop: The Definitive Guide

Running Hadoop On Ubuntu Linux (Single-Node Cluster)

Creating an 11.2.0.2 RAC Logical Standby Database

Yesterday I created an 11.2.0.2 RAC logical standby database for an 11.2.0.2 RAC primary database. Here are the steps.

You create a logical standby database by first creating a physical standby database and then transitioning it to a logical standby database. So first I created a physical standby database “belmont” for primary database “gilroy”:

SQL> SELECT db_unique_name,name,open_mode,database_role FROM v$database;
DB_UNIQUE_NAME                 NAME      OPEN_MODE            DATABASE_ROLE
------------------------------ --------- -------------------- ----------------
belmont                        GILROY    MOUNTED              PHYSICAL STANDBY

Before converting it to a logical standby, you need to stop Redo Apply on the physical standby database:

SQL> ALTER DATABASE recover managed standby DATABASE cancel;

A LogMiner dictionary must be built into the redo data so that the LogMiner component of SQL Apply can properly interpret changes it sees in the redo. To build the LogMiner dictionary, issue the following statement on primary database:

SQL> EXECUTE DBMS_LOGSTDBY.BUILD;

All auxiliary instances have to be shut down and disable the cluster on the target standby if your standby is a RAC (in my case). Shut down all but the instance on which the MRP was running (your actual target instance). Once they are all done, then disable the cluster and bounce the standby:

SQL> ALTER system SET cluster_database=FALSE scope=spfile;
SQL> shutdown immediate;
SQL> startup mount exclusive;

Now you are ready to tell MRP that it needs to continue applying redo data to the physical standby database until it is ready to convert to a logical standby database:

SQL> ALTER DATABASE recover TO logical standby fremont;

In above statement, you changed the actual database name of the standby to “fremont” so it can become a logical standby database. Data Guard will change the database name (DB_NAME) and set a new database identifier (DBID) for the logical standby.
At this point, you can re-enable the cluster database parameter, if you had a RAC, and then restart and open the new logical standby database:

SQL> ALTER system SET cluster_database=TRUE scope=spfile;
SQL> shutdown;
SQL> startup mount;
SQL> ALTER DATABASE OPEN resetlogs;

Issue the following statement to start SQL Apply in real-time apply mode using the IMMEDIATE keyword:

SQL> ALTER DATABASE START logical standby apply immediate;

Disabling SELinux and iptables in RHEL 6

There is no way to disable SELinux and iptables (firewall) during the Setup Agent in RHEL 6. So you need to do it manually as root.

Disabling SELinux

Modify the /etc/selinux/config file to disable SELinux:

SELINUX=disabled

Disabling iptables

Invoke the GUI tool “system-config-firewall” to disable firewall.

The Oracle Storage Model

Data is stored logically in segments (typically tables) and physically in datafiles. The tablespace entity abstracts the two: one tablespace can contain many segments and be made up of many datafiles. There is no direct relationship between a segment and a datafile. The datafiles can exist as files in a file system or (from release 10g onward) on ASM devices.

The separation of logical from physical storage is a necessary part of the relational database paradigm. The relational paradigm states that programmers should address only logical structures and let the database manage the mapping to physical structures. Any segment can exist in only one tablespace, but the tablespace can spread it across all the files making up the tablespace. This means that the tables’ sizes are not subject to any limitations imposed by the environment on maximum file size. As many segments can share a single tablespace, it becomes possible to have far more segments than there are datafiles.

Following figure shows the Oracle storage model sketched as en ER diagram, with the logical structures to the left and the physical structures to the right.

The Oracle Storage Model

The Oracle Storage Model

There is one relationship drawn in as a dotted line: a many-to-many relationship between segments and datafiles. This relationship is dotted, because it shouldn’t be there. As good relational engineers, DBAs do not permit many-to-many relationships. Resolving this relationship into a normalized structure is what the storage model is all about.

The tablespace entity resolves the many-to-many relationship between segments and  datafiles. One tablespace can contain many segments and be made up of many datafiles. This means that any one segment may be spread across multiple datafiles, and any one datafile may contain all of part of many segments.

The segment entity represents any database object that stores data and therefore requires space in a tablespace. Any segment can exist in only one tablespace. This means that the tables’ sizes are not subject to any limitations imposed by the environment on maximum file size. As many segments can share a single tablespace, it becomes possible to have far more segments than there are datafiles.

The Oracle block is the basic unit of I/O for the database. Datafiles are formatted into Oracle blocks, which are consecutively numbered. The size of Oracle blocks is fixed for a tablespace (generally speaking, it is the same for all tablespaces in the database); the default (with release 11g) is 8KB. A row might be only a couple of hundred byes, and so there could be many rows stored in one block, but when a session wants a row, the whole block will be read from disk into the database buffer cache.

Pre 11.2 Databases Can No Longer Register With 11.2.0.3 Oracle Restart

I set up an 11.2.0.3 Oracle Restart environment yesterday and created 10.2.0.4 and 11.1.0.7 databases. When I tried to register these two databases with 11.2.0.3 Oracle Restart using SRVCTL (the one in 11.2.0.3 GI home), it failed as follows:

oracle@bbupg:/home/oracle-> which srvctl
/u03/app/11.2.0/grid/bin/srvctl
oracle@bbupg:/home/oracle-> srvctl add database -d dbu10204 -o /u01/app/oracle/product/10.2.0/dbhome_1 -p +DATA/dbu10204/spfiledbu10204.ora -a "DATA,LOG"
PRCD-1245 : Addition of database version 11.2.0.3.0 is not allowed using srvctl version 10.2.0.0.0

This used to work with 11.2.0.2 Oracle Restart.

销售与打网球

前几天面试时被 M 问到一个问题:你觉得是通过代理商销售我们的产品比较容易,还是直接面向终端用户销售产品比较容易?我一时没有回答上来,M 给我打了这样一个比方:直接面向终端用户销售产品好比自己手拿球拍打网球,而通过代理商销售产品则好比自己作为教练,指导他人拿着球拍打网球。显然是后者的难度比较大,自己会打球和教会别人打球的难度是完全不同的,我深以为然。

不过当天晚上和 C 同学一起吃饭时,当我跟她提到这个问题,C 同学却有自己的看法,她认为企业想要做大最终必须依靠代理商的协助,借助代理商手中的客户资源来扩大销售,而在销售中凡事亲历亲为,直接面对终端客户是不现实的。企业在不同的阶段需要有不同的销售策略,听起来她的观点也很有道理。

最后记录一些题外话,M 和 L 在面试中提到的要点:

0. 要弄清楚 PSC 的定位和职责,同时要能够清楚的表达给客户,让对方明白自己的做了什么和不会做什么,这同时也是一种自我保护

1. Be professional,建立自己的口碑

2. PSC 是服务部门,自己觉得自己牛没用,只有客户说你牛才是真的牛,而这些都会反馈到 L 那里

3. 不会有人盯着你,告诉你该做这做那,需要极强的自我驱动,关键时刻就能看出平时的积累

4. 除了精通 DB 之外,需要选一个 DBO 作为第二专业,剩下的产品可以广泛涉猎,不要自我局限于 Tech, OFM 或者 Apps

5. 最好能学习一些行业知识,比如 Telecom, FSI 等

6. M要求 PSC 对产品的深入了解程度达到仅次于产品 R&D 的水平

7. 完成的工作要说出来,特别是在客户那里时

8. 客户那边时常有各个领域的高手,经常超过 O 记自己的员工,需要考虑 PSC 此时能 deliver 怎样的价值给客户

9. Paper work 也相当重要,这是 R&D 出身的同事的短板

2012 Wish List

0. 继续等待

1. CANON EOS 600D Kit + CANON EF 50mm f/1.4

2. 通过 11g OCP 升级考试 1Z0-050

3. 旅行,备选目的地包括:北京,海南,西安,厦门,丽江,Rio de Janeiro,Menlo Park

4. 学习一门新的编程语言,备选包括 Haskell, Lua, Ruby

5. 深入学习一种新的关系型数据库管理系统,备选包括 Sybase, MySQL, SQL Server

6. 深入学习一种 Big Data 解决方案,备选包括 Hadoop

7. 学习 Django

8. 复习法语

Enabling Root Login via SSH in Solaris 10

After fresh installation of Solaris 10, root login via SSH is disabled by default. It can be enabled as follows:

1. Modify /etc/ssh/sshd_config, set “PermitRootLogin” to yes;

2. Restart the SSH service:

# svcadm restart svc:/network/ssh:default