日期:2013/10/13 系統 :Ubuntu12.04LTS JDK :1.7.0_21 Nutch :2.2.1 MySQL :5.5.32 ------------------------------------------------------------------------------------------------------------------------------------------------------------
日期:2013/10/13
系統:Ubuntu12.04LTS
JDK:1.7.0_21
Nutch:2.2.1
MySQL:5.5.32
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Pre1:安裝配置OracleJDK
Pre2:安裝配置MySQL sudo apt-get install mysql-server,mysql-client
Pre3:安裝配置Apache Ant sudo apt-get install ant
Start:Ubuntu下搭建Nutch2.2.1,并以MySQL作為數據庫,UTF-8為默認編碼綜合配置
Step1:MySQL配置
首先編輯 /etc/mysql/my.cnf 文件在[mysqld]下面添加以下內容:
innodb_file_format=barracuda innodb_file_per_table=true innodb_large_prefix=true character-set-server=utf8 collation-server=utf8mb4_unicode_ci max_allowed_packet=500M
然后創建數據庫與數據表:
CREATE DATABASE nutch DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8;
CREATE TABLE `webpage` ( `id` varchar(767) NOT NULL, `headers` blob, `text` mediumtext DEFAULT NULL, `status` int(11) DEFAULT NULL, `markers` blob, `parseStatus` blob, `modifiedTime` bigint(20) DEFAULT NULL, `score` float DEFAULT NULL, `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL, `batchId` varchar(32) CHARACTER SET latin1 DEFAULT NULL, `baseUrl` varchar(767) DEFAULT NULL, `content` longblob, `title` varchar(2048) DEFAULT NULL, `reprUrl` varchar(767) DEFAULT NULL, `fetchInterval` int(11) DEFAULT NULL, `prevFetchTime` bigint(20) DEFAULT NULL, `inlinks` mediumblob, `prevSignature` blob, `outlinks` mediumblob, `fetchTime` bigint(20) DEFAULT NULL, `retriesSinceFetch` int(11) DEFAULT NULL, `protocolStatus` blob, `signature` blob, `metadata` blob, PRIMARY KEY (`id`) ) ENGINE=InnoDB ROW_FORMAT=COMPRESSED DEFAULT CHARSET=utf8;
Step2:Nutch配置
獲取Nutch2.2.1,從官網http://www.apache.org/dyn/closer.cgi/nutch/下載,然后解壓至本地安裝目錄,如本地根目錄為${APACHE_NUTCH_HOME}
將以下行的注釋取消:
default”/>
修改以下行:
編輯${APACHE_NUTCH_HOME}/conf/gora.properties文件,注釋掉默認的數據庫連接配置,同時添加以下配置內容:
############################### # MySQL configure # ############################### gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true gora.sqlstore.jdbc.user=xxxx(MySQL用戶名) gora.sqlstore.jdbc.password=xxxx(MySQL密碼)
修改${APACHE_NUTCH_HOME}/conf/gora.properties文件,這里的修改建議按照前面介紹的自動生成數據表的方法進行修改,網上說的要將primarykey的長度從512修改成767,即:
改: Step5:nutch-site.xml配置 添加以下配置: (關于ant的命令,這里就不說明了),只需要切換到${APACHE_NUTCH_HOME}下執行ant clean 然后ant 即可。構建完畢后會在${APACHE_NUTCH_HOME}目錄下生成runtime 文件夾。 Step:7 網頁抓取,種子配置 創建種子文件java.lang.NullPointerException
at org.apache.avro.util.Utf8.
cd${APACHE_NUTCH_HOME}/runtime/local
mkdir -p urls
echo 'http://www.sina.com.cn' > urls/seed.txt
echo 'http://www.ifeng.com' > urls/seed.txt
bin/nutchcrawl urls -depth 5 -topN 10
至此,已經完成了基本的配置。
聲明:本網頁內容旨在傳播知識,若有侵權等問題請及時與本網聯系,我們將在第一時間刪除處理。TEL:177 7030 7066 E-MAIL:11247931@qq.com