Nutch Code

1. Crawl

bin/nutch crawl urls -dir crawl -depth 5 -topN 100 >& crawl.log

crawl urls

if [ "$COMMAND" = "crawl" ] ; then
 CLASS=org.apache.nutch.crawl.Crawl

Crawl()

-dir crawl

The crawled documents are stored in the “crawl” folder

Advertisements
Posted in Programming | Leave a comment

[Japanese] Immigration Service

http://www.nii.ac.jp/daigakuin/STAFF/shorui_e.html

Posted in Life | Leave a comment

[Printer] Fuji Xerox Linux

To install Fuji Xerox Printer Driver on Linux (Ubuntu)

1. Download driver here http://www.fujixerox.co.jp/download/apeosport/download/c4300series/linux/

2. Extract the RPM file

3. Copy

pdftopdffx
pdftopjlfx
pstopdffx

from ~/fxlinuxprint-1.0.3-1.i386/usr/lib/cups/filter/

to /usr/lib/cups/filter/

4. System > Administration > Printing

5. Provide PPD file

~/fxlinuxprint-1.0.3-1.i386/usr/share/cups/model/FujiXerox/en/fxlinuxprint.ppd

Posted in Computer Related | Leave a comment

[MathSearch] Math webpages

http://en.wikipedia.org/

http://arxiv.org/

http://mathworld.wolfram.com

Posted in Computer Related | Leave a comment

[Ubuntu] Nutch

Nutch

http://wiki.apache.org/nutch/NutchTutorial

1. Download Nutch

Download Nutch from here http://nutch.apache.org/

2. Extract

Extract the downloaded compressed file to a folder. For example, I use

/home/nqminh/nutch-1.2

3. Create “urls” file

In the nutch folder, create a file named urls, this file contains the URLs of websites that we want to crawl. For example, I want to crawl Wikipedia, then I insert this URL into the “urls” file

http://en.wikipedia.org

4. Name the crawler

Open conf/nutch-default.xml file, find the keyword “http.agent.name", in its <value> tag, insert any name, in my case, I use <value>nqminh Spider</value>

5. Edit file conf/crawl-urlfilter.txt

# accept hosts in MY.DOMAIN.NAME

replace “MY.DOMAIN.NAME" with "en.wikipedia.org”

5. Run Crawl

bin/nutch crawl urls -dir crawl -depth 5 -topN 100 >& crawl.log

6. Search

For example, I want to search for the keyword “Project”

bin/nutch org.apache.nutch.searcher.NutchBean Project

7. Copy nutch*.war to tomcat*/webapps/nutch/

This will allow us to use user interface.

jar xvf /home/nqminh/apache-tomcat-7.0.12/webapps/nutch/nutch-1.2.war (1 time only)

8. Enable the apace tomcat for user interface (using web browser)

Go to nutch folder

/home/nqminh/apache-tomcat-7.0.12/bin/catalina.sh start [stop]

Visit http://localhost:8080/nutch/

Posted in Programming | Leave a comment

[Ubuntu] JAVA_HOME

sudo bash -c “echo JAVA_HOME=/usr/lib/jvm/java-6-openjdk/jre/ >> /etc/environment”

sudo bash -c “echo NUTCH_JAVA_HOME=/usr/lib/jvm/java-6-openjdk/jre/ >> /etc/environment”

Posted in Programming | Leave a comment

World Cup 2010 Japan vs Uruguay

Trong loạt đá penalty, Komano (3) là người duy nhất thực hiện không thành công quả sút của mình. Bóng chạm xà ngang và bay ra ngoài.

Ảnh

Posted in Life | Leave a comment

Cầu lông Shinjuku

Giải cầu lông quận Shinjuku.

Posted in Life | Leave a comment

World Cup 2010 Japan vs Denmark

Yasuhito Endo (7) Keisuke Honda (18) và Yoshito Okubo (16) ăn mừng bàn thắng vào lưới Đan Mạch trong trận đấu vòng loại bảng E World Cup 2010 trên sân vận động Royal Bafokeng tại Rustenburg, Nam Phi 24/06/2010.

Ảnh

Posted in Life | Leave a comment

Xin đểu

“Xin đểu” (Tuổi Trẻ Online)

Posted in Life | Leave a comment