使用窗口函数分析数据

本教程简要介绍 Drill 1.2 的分析,即分析窗口函数的 ANSI SQL 标准。Drill 支持以下窗口函数:

  • PARTITION BY 和 OVER 语法
  • 不同种类的聚合函数,如 Sum,Max,Min,Count,Avg
  • 解析函数,例如 First_Value, Last_Value, Lead, Lag, NTile, Row_Number 和 Rank

窗口函数是多功能的。你可以减少连接,子查询,和显示游标,你只需要写而已。窗口函数以最小的编码工作,解决了各种复杂的情况。

本教程建立在前面的教程之上,A-Y-A-D 数据分析高度动态的数据集分析,并且使用的是相同的 Yelp 数据集。

开始

  1. 在开始前,下载 Yelp(商业评论)。
  2. 安装并启动 Drill
  3. 在 Drill 中列出可用的 Schema。

    SHOW schemas;
    +---------------------+
    |     SCHEMA_NAME     |
    +---------------------+
    | INFORMATION_SCHEMA  |
    | cp.default          |
    | dfs.default         |
    | dfs.root            |
    | dfs.tmp             |
    | dfs.yelp            |
    | sys                 |
    +---------------------+
    
    7 rows selected (1.755 seconds)
    
  4. 切换工作目录。

    USE dfs.yelp;
    
    +-------+---------------------------------------+
    |  ok   |                summary                |
    +-------+---------------------------------------+
    | true  | Default schema changed to [dfs.yelp]  |
    +-------+---------------------------------------+
    
    1 row selected (0.129 seconds)
    
  5. 开始探索 Yelp 中可用的数据集信息。

    SELECT * FROM `business.json` LIMIT 1;
    
    +------------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+--------------------------------+---------+--------------+-------------------+-------------+-------+-------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+
    | business_id            | full_address                                        | hours                                                                                                                                                                                                                                               | open |         categories             |   city  | review_count |        name       |  longitude  | state | stars |  latitude | attributes                                                                                                                                          |    type  | neighborhoods |
    +------------------------+--------------+------+-------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+--------------------------------+---------+--------------+-------------------+-------------+-------+-------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+
    | vcNAWiLM4dR7D2nwwJ7nCA | 4840 E Indian School Rd Ste 101 Phoenix, AZ 85018   | {"Tuesday":{"close":"17:00","open":"08:00"},"Friday":{"close":"17:00","open":"08:00"},"Monday":{"close":"17:00","open":"08:00"},"Wednesday":{"close":"17:00","open":"08:00"},"Thursday":{"close":"17:00","open":"08:00"},"Sunday":{},"Saturday":{}} | true | ["Doctors","Health & Medical"] | Phoenix |      7       | Eric Goldberg, MD | -111.983758 |   AZ  |  3.5  | 33.499313 | {"By Appointment Only":true,"Good Ambience":{},"Parking":{},"Music":{},"Hair Types Specialized In":{},"Payment Types":{},"Dietary Restrictions":{}} | business |      []       |
    +-------------+--------------+-------+------+------------+------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+--------------------------------+---------+--------------+-------------------+-------------+-------+-------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+
    1 row selected (0.514 seconds)
    

使用窗口函数做简单的查询

  1. 获取 Yelp 的业务基础评论数据。

    SELECT name, city, review_count, row_number()
    OVER (PARTITION BY city ORDER BY review_count DESC) AS rownum
    FROM `business.json` LIMIT 15;  
    
    +----------------------------------------+------------+---------------+---------+
    |                  name                  |    city    | review_count  | rownum  |
    +----------------------------------------+------------+---------------+---------+
    | Cupz N' Crepes                         | Ahwatukee  | 124           | 1       |
    | My Wine Cellar                         | Ahwatukee  | 98            | 2       |
    | Kathy's Alterations                    | Ahwatukee  | 12            | 3       |
    | McDonald's                             | Ahwatukee  | 7             | 4       |
    | U-Haul                                 | Ahwatukee  | 5             | 5       |
    | Hi-Health                              | Ahwatukee  | 4             | 6       |
    | Healthy and Clean Living Environments  | Ahwatukee  | 4             | 7       |
    | Active Kids Pediatrics                 | Ahwatukee  | 4             | 8       |
    | Roberto's Authentic Mexican Food       | Anthem     | 117           | 1       |
    | Q to U BBQ                             | Anthem     | 74            | 2       |
    | Outlets At Anthem                      | Anthem     | 64            | 3       |
    | Dara Thai                              | Anthem     | 56            | 4       |
    | Cafe Provence                          | Anthem     | 53            | 5       |
    | Shanghai Club                          | Anthem     | 50            | 6       |
    | Two Brothers Kitchen                   | Anthem     | 43            | 7       |
    +----------------------------------------+------------+---------------+---------+
    15 rows selected (0.67 seconds)
    
  2. 检查每个业务的数量相比在城市的所有业务的平均数量的评论。

    SELECT name, city,review_count,
    Avg(review_count) OVER (PARTITION BY City) AS city_reviews_avg
    FROM `business.json` LIMIT 15;
    
    +----------------------------------------+------------+---------------+---------------------+
    |                  name                  |    city    | review_count  |  city_reviews_avg   |
    +----------------------------------------+------------+---------------+---------------------+
    | Hi-Health                              | Ahwatukee  | 4             | 32.25               |
    | My Wine Cellar                         | Ahwatukee  | 98            | 32.25               |
    | U-Haul                                 | Ahwatukee  | 5             | 32.25               |
    | Cupz N' Crepes                         | Ahwatukee  | 124           | 32.25               |
    | McDonald's                             | Ahwatukee  | 7             | 32.25               |
    | Kathy's Alterations                    | Ahwatukee  | 12            | 32.25               |
    | Healthy and Clean Living Environments  | Ahwatukee  | 4             | 32.25               |
    | Active Kids Pediatrics                 | Ahwatukee  | 4             | 32.25               |
    | Anthem Community Center                | Anthem     | 4             | 14.492063492063492  |
    | Scrapbooks To Remember                 | Anthem     | 4             | 14.492063492063492  |
    | Hungry Howie's Pizza                   | Anthem     | 7             | 14.492063492063492  |
    | Pinata Nueva                           | Anthem     | 3             | 14.492063492063492  |
    | Starbucks Coffee Company               | Anthem     | 13            | 14.492063492063492  |
    | Pizza Hut                              | Anthem     | 6             | 14.492063492063492  |
    | Rays Pizza                             | Anthem     | 19            | 14.492063492063492  |
    +----------------------------------------+------------+---------------+---------------------+
    15 rows selected (0.395 seconds)
    
  3. 检查每个企业的评论数量为城市的所有业务的总数量的贡献。

    SELECT name, city,review_count,
    Sum(review_count) OVER (PARTITION BY City) AS city_reviews_sum
    FROM `business.json`limit 15;
    
    +----------------------------------------+------------+---------------+-------------------+
    |                  name                  |    city    | review_count  | city_reviews_sum  |
    +----------------------------------------+------------+---------------+-------------------+
    | Hi-Health                              | Ahwatukee  | 4             | 258               |
    | My Wine Cellar                         | Ahwatukee  | 98            | 258               |
    | U-Haul                                 | Ahwatukee  | 5             | 258               |
    | Cupz N' Crepes                         | Ahwatukee  | 124           | 258               |
    | McDonald's                             | Ahwatukee  | 7             | 258               |
    | Kathy's Alterations                    | Ahwatukee  | 12            | 258               |
    | Healthy and Clean Living Environments  | Ahwatukee  | 4             | 258               |
    | Active Kids Pediatrics                 | Ahwatukee  | 4             | 258               |
    | Anthem Community Center                | Anthem     | 4             | 913               |
    | Scrapbooks To Remember                 | Anthem     | 4             | 913               |
    | Hungry Howie's Pizza                   | Anthem     | 7             | 913               |
    | Pinata Nueva                           | Anthem     | 3             | 913               |
    | Starbucks Coffee Company               | Anthem     | 13            | 913               |
    | Pizza Hut                              | Anthem     | 6             | 913               |
    | Rays Pizza                             | Anthem     | 19            | 913               |
    +----------------------------------------+------------+---------------+-------------------+
    15 rows selected (0.543 seconds)
    

使用窗口函数进行复杂查询

  1. 排名前 10 名的城市和他们的排名最高的企业数量的评论。使用 Drill 窗口函数,例如 rank,dense_sank。

    WITH X
    AS
    (SELECT name, city, review_count,
    RANK()
    OVER (PARTITION BY city
    ORDER BY review_count DESC) AS review_rank
    FROM `business.json`)
    SELECT X.name, X.city, X.review_count
    FROM X
    WHERE X.review_rank =1 ORDER BY review_count DESC LIMIT 10;
    
    +-------------------------------------------+-------------+---------------+
    |                   name                    |    city     | review_count  |
    +-------------------------------------------+-------------+---------------+
    | Mon Ami Gabi                              | Las Vegas   | 4084          |
    | Studio B                                  | Henderson   | 1336          |
    | Phoenix Sky Harbor International Airport  | Phoenix     | 1325          |
    | Four Peaks Brewing Co                     | Tempe       | 1110          |
    | The Mission                               | Scottsdale  | 783           |
    | Joe's Farm Grill                          | Gilbert     | 770           |
    | The Old Fashioned                         | Madison     | 619           |
    | Cornish Pasty Company                     | Mesa        | 578           |
    | SanTan Brewing Company                    | Chandler    | 469           |
    | Yard House                                | Glendale    | 321           |
    +-------------------------------------------+-------------+---------------+
    10 rows selected (0.49 seconds)
    
  2. 在城市的顶部和底部的评论计数的每个业务的评论数量比较。

    SELECT name, city, review_count,
    FIRST_VALUE(review_count)
    OVER(PARTITION BY city ORDER BY review_count DESC) AS top_review_count,
    LAST_VALUE(review_count)
    OVER(PARTITION BY city ORDER BY review_count DESC) AS bottom_review_count
    FROM `business.json` limit 15;
    
    +----------------------------------------+------------+---------------+-------------------+----------------------+
    |                  name                  |    city    | review_count  | top_review_count  | bottom_review_count  |
    +----------------------------------------+------------+---------------+-------------------+----------------------+
    | My Wine Cellar                         | Ahwatukee  | 98            | 124               | 12                   |
    | McDonald's                             | Ahwatukee  | 7             | 124               | 12                   |
    | U-Haul                                 | Ahwatukee  | 5             | 124               | 12                   |
    | Hi-Health                              | Ahwatukee  | 4             | 124               | 12                   |
    | Healthy and Clean Living Environments  | Ahwatukee  | 4             | 124               | 12                   |
    | Active Kids Pediatrics                 | Ahwatukee  | 4             | 124               | 12                   |
    | Cupz N' Crepes                         | Ahwatukee  | 124           | 124               | 12                   |
    | Kathy's Alterations                    | Ahwatukee  | 12            | 124               | 12                   |
    | Q to U BBQ                             | Anthem     | 74            | 117               | 117                  |
    | Dara Thai                              | Anthem     | 56            | 117               | 117                  |
    | Cafe Provence                          | Anthem     | 53            | 117               | 117                  |
    | Shanghai Club                          | Anthem     | 50            | 117               | 117                  |
    | Two Brothers Kitchen                   | Anthem     | 43            | 117               | 117                  |
    | The Tennessee Grill                    | Anthem     | 32            | 117               | 117                  |
    | Dollyrockers Boutique and Salon        | Anthem     | 30            | 117               | 117                  |
    +----------------------------------------+------------+---------------+-------------------+----------------------+
    15 rows selected (0.516 seconds)
    
  3. 比较前后的业务评论数量。

    SELECT city, review_count, name,
    LAG(review_count, 1) OVER(PARTITION BY city ORDER BY review_count DESC)
    AS preceding_count,
    LEAD(review_count, 1) OVER(PARTITION BY city ORDER BY review_count DESC)
    AS following_count
    FROM `business.json` limit 15;
    
    +------------+---------------+----------------------------------------+------------------+------------------+
    |    city    | review_count  |                  name                  | preceding_count  | following_count  |
    +------------+---------------+----------------------------------------+------------------+------------------+
    | Ahwatukee  | 124           | Cupz N' Crepes                         | null             | 98               |
    | Ahwatukee  | 98            | My Wine Cellar                         | 124              | 12               |
    | Ahwatukee  | 12            | Kathy's Alterations                    | 98               | 7                |
    | Ahwatukee  | 7             | McDonald's                             | 12               | 5                |
    | Ahwatukee  | 5             | U-Haul                                 | 7                | 4                |
    | Ahwatukee  | 4             | Hi-Health                              | 5                | 4                |
    | Ahwatukee  | 4             | Healthy and Clean Living Environments  | 4                | 4                |
    | Ahwatukee  | 4             | Active Kids Pediatrics                 | 4                | null             |
    | Anthem     | 117           | Roberto's Authentic Mexican Food       | null             | 74               |
    | Anthem     | 74            | Q to U BBQ                             | 117              | 64               |
    | Anthem     | 64            | Outlets At Anthem                      | 74               | 56               |
    | Anthem     | 56            | Dara Thai                              | 64               | 53               |
    | Anthem     | 53            | Cafe Provence                          | 56               | 50               |
    | Anthem     | 50            | Shanghai Club                          | 53               | 43               |
    | Anthem     | 43            | Two Brothers Kitchen                   | 50               | 32               |
    +------------+---------------+----------------------------------------+------------------+------------------+
    15 rows selected (0.518 seconds)
    
Copyright © smartloli 2016 all right reserved,powered by GitbookModify: 2016-08-26 06:59:28