探索美国共享单车数据
项目概述数据集问题需要回答的问题需要有互动式体验
项目流程导入库及数据集生成初始页面(接受用户输入的城市、月份、日期)根据用户的输入,读取相应的数据展示用户使用单车的时长中,出现频率最多的时长展示用户经过最多的行程(起始站-终点站)展示用户骑行的总时间、平均时间展示使用单车的用户类型、性别、年龄状况设置主函数
如何下载并使用参考资料
项目概述
在此项目中,将利用 Python 探索与以下三大美国城市的自行车共享系统相关的数据:芝加哥、纽约和华盛顿特区。 将编写代码导入数据,并通过计算描述性统计数据回答有趣的问题。并将写一个脚本,该脚本会接受原始输入并在终端中创建交互式体验,以展现这些统计信息。
数据集
提供了三座城市 2017 年上半年的数据。三个数据文件都包含相同的核心六 (6) 列:
起始时间 Start Time(例如 2017-01-01 00:07:57)结束时间 End Time(例如 2017-01-01 00:20:53)骑行时长 Trip Duration(例如 776 秒)起始车站 Start Station(例如百老汇街和巴里大道)结束车站 End Station(例如塞奇威克街和北大道)用户类型 User Type(订阅者 Subscriber/Registered 或客户Customer/Casual)
芝加哥和纽约市文件还包含以下两列(数据格式可以查看下面的图片):
性别 Gender出生年份 Birth Year
问题
需要回答的问题
将编写代码并回答以下关于自行车共享数据的问题:
起始时间(Start Time 列)中哪个月份最常见?起始时间中,一周的哪一天(比如 Monday, Tuesday)最常见? 提示:可以使用 datetime.weekday() (点击查看文档)起始时间中,一天当中哪个小时最常见?总骑行时长(Trip Duration)是多久,平均骑行时长是多久?哪个起始车站(Start Station)最热门,哪个结束车站(End Station)最热门?哪一趟行程最热门(即,哪一个起始站点与结束站点的组合最热门)?每种用户类型有多少人?每种性别有多少人?出生年份最早的是哪一年、最晚的是哪一年,最常见的是哪一年?
需要有互动式体验
最终文件要是一个脚本,它接受原始输入在终端中(如Windows的cmd中国)创建交互式体验,来回答有关数据集的问题。这种体验之所以是交互式的,是因为根据用户输入的内容,下一页面中的数据结果也会随之改变(用input()实现)。
有以下三个问题会对结果产生影响:
你想分析哪个城市的数据?输入:芝加哥,纽约,华盛顿 ( Would you like to see data for Chicago, New York, or Washington?)你想分析几月的数据?输入:全部,一月,二月…六月 ( Which month? all, january, february, … , june?)你想分析星期几的数据?输入:全部,星期一,星期二…星期日 (Which day? all, monday, tuesday, … sunday?)
这几个问题的答案将用来确定进行数据分析的城市,同时选择过滤某个月份或星期的数据。在相应的数据集过滤和加载完毕后,用户会看到数据的统计结果,并选择重新开始或退出。输入的信息应当大小写不敏感,比如"Chicago", “CHICAGO”, “chicago”, “chiCago”都是有效输入。你可以使用 lower(), upper(), title() 等字符串方法对输入值进行处理。
项目流程
导入库及数据集
import time
import pandas
as pd
import numpy
as np
CITY_DATA
= { 'chicago': 'chicago.csv',
'new york city': 'new_york_city.csv',
'washington': 'washington.csv' }
生成初始页面(接受用户输入的城市、月份、日期)
def get_filters():
"""
Asks user to specify a city, month, and day to analyze.
Returns:
(str) city - name of the city to analyze
(str) month - name of the month to filter by, or "all" to apply no month filter
(str) day - name of the day of week to filter by, or "all" to apply no day filter
"""
print('\nHello! Let\'s explore some US bikeshare data!')
def input_mod(input_print
, enterable_list
):
ret
= input(input_print
)
while ret
.lower
() not in enterable_list
:
ret
= input(input_print
)
return ret
city
= input_mod
('\nPlease input the name of city which you want to analyze: Chicago, New york city, Washington or all!\n', list(CITY_DATA
.keys
()) + ['all'])
month
= input_mod
('\nPlease input the month you want to analyze: all, january, february, ... , june!\n', ['january', 'february', 'march', 'april', 'may', 'june', 'all'])
day
= input_mod
('\nPlease input the day-of-week you want to analyze: all, monday, tuesday, ... sunday!\n', ['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday', 'all'])
print('-'*40)
return city
, month
, day
根据用户的输入,读取相应的数据
def load_data(city
, month
, day
):
"""
Loads data for the specified city and filters by month and day if applicable.
Args:
(str) city - name of the city to analyze
(str) month - name of the month to filter by, or "all" to apply no month filter
(str) day - name of the day of week to filter by, or "all" to apply no day filter
Returns:
df - Pandas DataFrame containing city data filtered by month and day
"""
try:
df
= pd
.read_csv
(CITY_DATA
[city
.lower
()])
except:
df_chicago
= pd
.read_csv
(CITY_DATA
['chicago'])
df_new_york_city
= pd
.read_csv
(CITY_DATA
['new york city'])
df_washington
= pd
.read_csv
(CITY_DATA
['washington'])
df
= df_chicago
.append
([df_new_york_city
, df_washington
], ignore_index
= True, sort
= False)
df
['Start Time'] = pd
.to_datetime
(df
['Start Time'])
df
['month'] = df
['Start Time'].dt
.month
df
['day_of_week'] = df
['Start Time'].dt
.weekday_name
if month
.lower
() != 'all':
months
= ['january', 'february', 'march', 'april', 'may', 'june']
month
= months
.index
(month
.lower
()) + 1
df
= df
[df
['month'] == month
]
if day
.lower
() != 'all':
df
= df
[df
['day_of_week'] == day
.title
()]
return df
展示用户使用单车的时长中,出现频率最多的时长
def time_stats(df
):
"""Displays statistics on the most frequent times of travel."""
print('\nCalculating The Most Frequent Times of Travel...\n')
start_time
= time
.time
()
popular_month
= df
['month'].mode
()[0]
print('The most common month is:', popular_month
)
popular_day
= df
['day_of_week'].mode
()[0]
print('The most common day of week is:', popular_day
)
df
['start hour'] = df
['Start Time'].dt
.hour
popular_start_hour
= df
['start hour'].mode
()[0]
print('The most common start hour is:', popular_start_hour
)
print("\nThis took %s seconds." % (time
.time
() - start_time
))
print('-'*40)
展示用户经过最多的行程(起始站-终点站)
def station_stats(df
):
"""Displays statistics on the most popular stations and trip."""
print('\nCalculating The Most Popular Stations and Trip...\n')
start_time
= time
.time
()
popular_start_station
= df
['Start Station'].mode
()[0]
print('The most common start station is:', popular_start_station
)
popular_end_station
= df
['End Station'].mode
()[0]
print('The most common end station is:', popular_end_station
)
top
= df
.groupby
(['Start Station', 'End Station']).size
().idxmax
()
print("The most frequent combination of start station and end station trip is \'{}\' to \'{}\'".format(top
[0], top
[1]))
print("\nThis took %s seconds." % (time
.time
() - start_time
))
print('-'*40)
展示用户骑行的总时间、平均时间
def trip_duration_stats(df
):
"""Displays statistics on the total and average trip duration."""
print('\nCalculating Trip Duration...\n')
start_time
= time
.time
()
total_time
= df
['Trip Duration'].sum()
print('The total travel time is:', total_time
, 'minutes.')
mean_time
= df
['Trip Duration'].mean
()
print('The mean travel time is:', mean_time
, 'minutes.')
print("\nThis took %s seconds." % (time
.time
() - start_time
))
print('-'*40)
展示使用单车的用户类型、性别、年龄状况
def user_stats(df
):
"""Displays statistics on bikeshare users."""
print('\nCalculating User Stats...\n')
start_time
= time
.time
()
user_types
= df
['User Type'].value_counts
()
print('The counts of user types is:', '\n', user_types
)
df
= df
.dropna
()
try:
gender
= df
['Gender'].value_counts
()
print('\nThe counts of gender is:', '\n', gender
)
except:
print('\nSorry,there\'s no such data to analyze.')
try:
earliest_birth
= df
['Birth Year'].min()
most_recent_birth
= df
['Birth Year'].max()
most_common_year
= df
['Birth Year'].mode
()[0]
print('\nThe earlierst year of birth is:', earliest_birth
)
print('The most recent year of birth is:', most_recent_birth
)
print('The most common year of birth is:', most_common_year
)
except:
print('\nSorry,there\'s no data of \'Birth Year\' to analyze.')
print("\nThis took %s seconds." % (time
.time
() - start_time
))
print('-'*40)
设置主函数
def main():
while True:
city
, month
, day
= get_filters
()
df
= load_data
(city
, month
, day
)
time_stats
(df
)
station_stats
(df
)
trip_duration_stats
(df
)
user_stats
(df
)
restart
= input('\nWould you like to restart? Enter yes or no.\n')
if restart
.lower
() != 'yes':
break
if __name__
== "__main__":
main
()
如何下载并使用
代码下载链接:我的百度网盘 提取码:ur0i
所需环境:Python3
终端运行:Windows用户打开cmd,进入到存储文件,使用ipython bikeshare.py进入,根据文字提示操作即可。
参考资料
参考文档1:https://blog.csdn.net/milton2017/article/details/54406482/ 参考文档2:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html