怎么使用MapReduce-創(chuàng)新互聯(lián)

這篇文章給大家分享的是有關(guān)怎么使用MapReduce的內(nèi)容。小編覺得挺實用的，因此分享給大家做個參考，一起跟隨小編過來看看吧。

為太倉等地區(qū)用戶提供了全套網(wǎng)頁設(shè)計制作服務(wù)，及太倉網(wǎng)站建設(shè)行業(yè)解決方案。主營業(yè)務(wù)為成都網(wǎng)站制作、做網(wǎng)站、太倉網(wǎng)站設(shè)計，以傳統(tǒng)方式定制建設(shè)網(wǎng)站，并提供域名空間備案等一條龍服務(wù)，秉承以專業(yè)、用心的態(tài)度為用戶提供真誠的服務(wù)。我們深信只要達(dá)到每一位用戶的要求，就會得到認(rèn)可，從而選擇與我們長期合作。這樣，我們也可以走得更遠(yuǎn)！

大數(shù)據(jù)是使用工具和技術(shù)處理大量和復(fù)雜數(shù)據(jù)集合的術(shù)語。能夠處理大量數(shù)據(jù)的技術(shù)稱為MapReduce。怎么使用MapReduce

何時使用MapReduce

MapReduce特別適合涉及大量數(shù)據(jù)的問題。它通過將工作分成更小的塊，然后可以被多個系統(tǒng)處理。由于MapReduce將一個問題分片并行工作，與傳統(tǒng)系統(tǒng)相比，解決方案會更快。

大概有如下場景會應(yīng)用到MapReduce：

1 計數(shù)和統(tǒng)計

2 整理

3 過濾

4 排序

Apache Hadoop

在本文中，我們將使用Apache Hadoop。

開發(fā)MapReduce解決方案，推薦使用Hadoop，它已經(jīng)是事實上的標(biāo)準(zhǔn)，同時也是開源免費的軟件。

另外在Amazon，Google和Microsoft等云提供商租用或搭建Hadoop集群。

還有其他多個優(yōu)點：

可擴展：可以輕松清加新的處理節(jié)點，而無需更改一行代碼

成本效益：不需要任何專門和奇特的硬件，因為軟件在正常的硬件都運行正常

靈活：無模式?？梢蕴幚砣魏螖?shù)據(jù)結(jié)構(gòu) ，甚至可以組合多個數(shù)據(jù)源，而不會有很多問題。

容錯：如果有節(jié)點出現(xiàn)問題，其它節(jié)點可以接收它的工作，整個集群繼續(xù)處理。

另外，Hadoop容器還是支持一種稱為“流”的應(yīng)用程序，它為用戶提供了選擇用于開發(fā)映射器和還原器腳本語言的自由度。

本文中我們將使用PHP做為主開發(fā)語言。

怎么使用MapReduce

Hadoop安裝

Apache Hadoop的安裝配置超出了本文范圍。您可以根據(jù)自己的平臺，在線輕松找到很多文章。為了保持簡單，我們只討論大數(shù)據(jù)相關(guān)的事。

映射器（Mapper）

映射器的任務(wù)是將輸入轉(zhuǎn)換成一系列的鍵值對。比如在字計數(shù)器的情況下，輸入是一系列的行。我們按單詞將它們分開，把它們變成鍵值對（如key:word,value:1）,看起來像這樣：

the 1

water 1

on 1

water 1

on 1

... 1

然后，這些對然后被發(fā)送到reducer以進(jìn)行下一步驟。

reducer

reducer的任務(wù)是檢索（排序）對，迭代并轉(zhuǎn)換為所需輸出。在單詞計數(shù)器的例子中，取單詞數(shù)（值），并將它們相加得到一個單詞（鍵）及其最終計數(shù)。如下：

water 2

the 1

on 3

mapping和reducing的整個過程看起來有點像這樣，請看下列之圖表：

我們將從MapReduce世界的“Hello World”的例子開始，那就是一個簡單的單詞計數(shù)器的實現(xiàn)。我們將需要一些數(shù)據(jù)來處理。我們用已經(jīng)公開的書Moby Dick來做實驗。

怎么使用MapReduce

執(zhí)行以下命令下載這本書：

wget http://www.gutenberg.org/cache ... 1.txt

在HDFS（Hadoop分布式文件系統(tǒng)）中創(chuàng)建一個工作目錄

hadoop dfs -mkdir wordcount

我們的PHP代碼從mapper開始

#!/usr/bin/php<?php    // iterate through lines
    while($line = fgets(STDIN)){
        // remove leading and trailing
        $line = ltrim($line);
        $line = rtrim($line);
        // split the line in words
        $words = preg_split('/\s/', $line, -1, PREG_SPLIT_NO_EMPTY);
        // iterate through words
        foreach( $words as $key ) {
            // print word (key) to standard output
            // the output will be used in the
            // reduce (reducer.php) step
            // word (key) tab-delimited wordcount (1)
            printf("%s\t%d\n", $key, 1);
        }
    }?>

下面是 reducer 代碼。

#!/usr/bin/php<?php
    $last_key = NULL;
    $running_total = 0;
    // iterate through lines
    while($line = fgets(STDIN)) {
        // remove leading and trailing
        $line = ltrim($line);
        $line = rtrim($line);
        // split line into key and count
        list($key,$count) = explode("\t", $line);
        // this if else structure works because
        // hadoop sorts the mapper output by it keys
        // before sending it to the reducer
        // if the last key retrieved is the same
        // as the current key that have been received
        if ($last_key === $key) {
            // increase running total of the key
            $running_total += $count;
        } else {
            if ($last_key != NULL)
                // output previous key and its running total
                printf("%s\t%d\n", $last_key, $running_total);
            // reset last key and running total
            // by assigning the new key and its value
            $last_key = $key;
            $running_total = $count;
        }
    }?>

你可以通過使用某些命令和管道的組合來在本地輕松測試腳本。

head -n1000 pg2701.txt | ./mapper.php | sort | ./reducer.php

我們在Apache Hadoop集群上運行它：

hadoop jar /usr/hadoop/2.5.1/libexec/lib/hadoop-streaming-2.5.1.jar \ -mapper "./mapper.php"
 -reducer "./reducer.php"
 -input "hello/mobydick.txt"
 -output "hello/result"

輸出將存儲在文件夾hello / result中，可以通過執(zhí)行以下命令查看

hdfs dfs -cat hello/result/part-00000

計算年均黃金價格

下一個例子是一個更實際的例子，雖然數(shù)據(jù)集相對較小，但是相同的邏輯可以很容易地應(yīng)用于具有數(shù)百個數(shù)據(jù)點的集合上。我們將嘗試計算過去五十年的黃金年平均價格。

我們下載數(shù)據(jù)集：

wget https://raw.githubusercontent. ... a.csv

在HDFS（Hadoop分布式文件系統(tǒng)）中創(chuàng)建一個工作目錄

hadoop dfs -mkdir goldprice

將已下載的數(shù)據(jù)集復(fù)制到HDFS

hadoop dfs -copyFromLocal ./data.csv goldprice/data.csv

我的reducer看起來像這樣

#!/usr/bin/php<?php    // iterate through lines
    while($line = fgets(STDIN)){
        // remove leading and trailing
        $line = ltrim($line);
        $line = rtrim($line);
        // regular expression to capture year and gold value
        preg_match("/^(.*?)\-(?:.*),(.*)$/", $line, $matches);
        if ($matches) {
            // key: year, value: gold price
            printf("%s\t%.3f\n", $matches[1], $matches[2]);
        }
    }?>

reducer也略有修改，因為我們需要計算項目數(shù)量和平均值。

#!/usr/bin/php<?php
    $last_key = NULL;
    $running_total = 0;
    $running_average = 0;
    $number_of_items = 0;
    // iterate through lines
    while($line = fgets(STDIN)) {
        // remove leading and trailing
        $line = ltrim($line);
        $line = rtrim($line);
        // split line into key and count
        list($key,$count) = explode("\t", $line);
        // if the last key retrieved is the same
        // as the current key that have been received
        if ($last_key === $key) {
            // increase number of items
            $number_of_items++;
            // increase running total of the key
            $running_total += $count;
            // (re)calculate average for that key
            $running_average = $running_total / $number_of_items;
        } else {
            if ($last_key != NULL)
                // output previous key and its running average
                printf("%s\t%.4f\n", $last_key, $running_average);
            // reset key, running total, running average
            // and number of items
            $last_key = $key;
            $number_of_items = 1;
            $running_total   = $count;
            $running_average = $count;
        }
    }
    if ($last_key != NULL)
        // output previous key and its running average
        printf("%s\t%.3f\n", $last_key, $running_average);?>

像單詞統(tǒng)計樣例一樣，我們也可以在本地測試

head -n1000 data.csv | ./mapper.php | sort | ./reducer.php

最終在hadoop集群上運行它

hadoop jar /usr/hadoop/2.5.1/libexec/lib/hadoop-streaming-2.5.1.jar \ -mapper "./mapper.php"
 -reducer "./reducer.php"
 -input "goldprice/data.csv"
 -output "goldprice/result"

查看平均值

hdfs dfs -cat goldprice/result/part-00000

小獎勵：生成圖表

我們經(jīng)常會將結(jié)果轉(zhuǎn)換成圖表。對于這個演示，我將使用gnuplot，你可以使用其它任何有趣的東西。

首先在本地返回結(jié)果：

hdfs dfs -get goldprice/result/part-00000 gold.dat

創(chuàng)建一個gnu plot配置文件（gold.plot）并復(fù)制以下內(nèi)容

# Gnuplot script file for generating gold pricesset terminal pngset output "chart.jpg"set style data linesset nokeyset gridset title "Gold prices"set xlabel "Year"set ylabel "Price"plot "gold.dat"

生成圖表：

gnuplot gold.plot

感謝各位的閱讀！關(guān)于“怎么使用MapReduce”這篇文章就分享到這里了，希望以上內(nèi)容可以對大家有一定的幫助，讓大家可以學(xué)到更多知識，如果覺得文章不錯，可以把它分享出去讓更多的人看到吧！

本文標(biāo)題：怎么使用MapReduce-創(chuàng)新互聯(lián)
網(wǎng)站URL：http://aaarwkj.com/article24/cojcce.html

成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián)，為您提供網(wǎng)站排名、網(wǎng)站維護(hù)、手機網(wǎng)站建設(shè)、網(wǎng)站制作、外貿(mào)網(wǎng)站建設(shè)、商城網(wǎng)站

聲明：本網(wǎng)站發(fā)布的內(nèi)容（圖片、視頻和文字）以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主，如果涉及侵權(quán)請盡快告知，我們將會在第一時間刪除。文章觀點不代表本網(wǎng)站立場，如需處理請聯(lián)系客服。電話：028-86922220；郵箱：631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載，或轉(zhuǎn)載時需注明來源：創(chuàng)新互聯(lián)

猜你還喜歡下面的內(nèi)容

欧美一级特黄大片做受成人-亚洲成人一区二区电影-激情熟女一区二区三区-日韩专区欧美专区国产专区

怎么使用MapReduce-創(chuàng)新互聯(lián)