LAB2

Recording
Slides
Type
LAB
关于数据的下载,对库pathlib的使用,以及对数据下载逻辑的处理。
import requests from pathlib import Path def fetch_and_cache(data_url, file, data_dir="data", force=False): """ Download and cache a url and return the file object. data_url: the web address to download file: the file in which to save the results. data_dir: (default="data") the location to save the data force: if true the file is always re-downloaded return: The pathlib.Path to the file. """ data_dir = Path(data_dir) data_dir.mkdir(exist_ok=True) file_path = data_dir/Path(file) if force and file_path.exists(): file_path.unlink() if force or not file_path.exists(): print('Downloading...', end=' ') resp = requests.get(data_url) with file_path.open('wb') as f: f.write(resp.content) print('Done!') else: import time created = time.ctime(file_path.stat().st_ctime) print("Using cached version downloaded at", created) return file_path
In Python, a `Path` object represents the filesystem paths to files (and other resources). The `pathlib` module is effective for writing code that works on different operating systems and filesystems. To check if a file exists at a path, use `.exists()`. To create a directory for a path, use `.mkdir()`. To remove a file that might be a [symbolic link](https://en.wikipedia.org/wiki/Symbolic_link), use `.unlink()`. This function creates a path to a directory that will contain data files. It ensures that the directory exists (which is required to write files in that directory), then proceeds to download the file based on its URL. The benefit of this function is that not only can you force when you want a new file to be downloaded using the `force` parameter, but in cases when you don't need the file to be re-downloaded, you can use the cached version and save download time.
import zipfile zf = zipfile.ZipFile(namesbystate_path, 'r') column_labels = ['State', 'Sex', 'Year', 'Name', 'Count'] def load_dataframe_from_zip(zf, f): with zf.open(f) as fh: return pd.read_csv(fh, header=None, names=column_labels) states = [ load_dataframe_from_zip(zf, f) for f in sorted(zf.filelist, key=lambda x:x.filename) if f.filename.endswith('.TXT') ] baby_names = states[0] for state_df in states[1:]: baby_names = pd.concat([baby_names, state_df]) baby_names = baby_names.reset_index().iloc[:, 1:]
注意此处对 多个 文件的合并逻辑
 
 
使用 query 来过滤数据,此时,列名不需要加引号,如果需要用到 之前定义的遍历,可以通过@变量名 来访问该变量。
 
在notebook中,可以简单使用display来展示任何东西,该方式展示出来的和直接在某一单元输出变量一样,适合用在循环中,或者别的代码块中,例如:
for n, g in elections.query("Year >= 1980").groupby("Party"): print(f"Name: {n}") # by the way this is an "f string", a relatively new and great feature of Python display(g)
 
 
在利用pandas,经常会用到链式操作,这样写出来的代码通常很长,所以一般写作如下:
( elections.query("Year >= 1980").groupby("Party") .mean() ## computes the mean values by party .reset_index() ## reset to a numerical index ) # 或者 用 \ 分隔开来 elections.query("Year >= 1980").groupby("Party") \ .mean() \ .reset_index()