pd.Series(np.random.random(4), index=['a', 'b', 'c', 'd'] , name="uniform draws")
Series 可以通过下标和label访问,可以通过字典初始化
s3 = pd.Series({"lions":2, "tigers":1, "bears":3}, name="oh my")
s3[1:]
s3[np.array([len(i) == 5 for i in s3.index])]
s3[ ["tigers", "bears"] ]
Series 之间做运算得到的结果index 是二者的并集,也就是说index决定了两个series如何运算
Method | Returns |
---|---|
abs() | Object with absolute values taken (of numerical data) |
argmax() | The index label of the maximum value |
argmin() | The index label of the minimum value |
count() | The number of non-null entries |
cumprod() | The cumulative product over an axis |
cumsum() | The cumulative sum over an axis |
max() | The maximum of the entries |
mean() | The average of the entries |
median() | The median of the entries |
min() | The minimum of the entries |
mode() | The most common element(s) |
prod() | The product of the elements |
sum() | The sum of the elements |
var() | The variance of the elements |
pd.date_range("7/1/2000", "7/3/2000", freq='D')
DatetimeIndex(['2000-07-01', '2000-07-02', '2000-07-03'],
dtype='datetime64[ns]', freq='D')
Method | Description |
---|---|
append() | Concatenate two or more Series. |
drop() | Remove the entries with the specified label or labels |
drop_duplicates() | Remove duplicate values |
dropna() | Drop null entries |
fillna() | Replace null entries with a specified value or strategy |
reindex() | Replace the index |
sample() | Draw a random entry |
shift() | Shift the index |
unique() | Return unique values |
Dataframe
Dataframe是series的集合,一列就是一个series
df1 = pd.DataFrame({"series 1": x, "series 2": y})
df1["series1"].dropna() # recovers a series
>>> data = np.random.random((3, 4))
>>> pd.DataFrame(data, index=['A', 'B', 'C'], columns=np.arange(1, 5))
1 2 3 4
A 0.065646 0.968593 0.593394 0.750110
B 0.803829 0.662237 0.200592 0.137713
C 0.288801 0.956662 0.817915 0.951016
3 rows 4 columns
可以行切片,可以使用loc(label)和iloc(index)
df = pd.DataFrame(np.random.randn(4, 2), index=['a', 'b', 'c', 'd'],columns = ['I', 'II'])
df[:2]
I II
a 0.758867 1.231330
b 0.402484 -0.955039
>>> # select rows a and c, column II
>>> df.loc[['a','c'], 'II']
a 1.231330
c 0.556121
Name: II, dtype: float64
>>> # select last two rows, first column
>>> df.iloc[-2:, 0]
c -0.475952
d -0.518989
Name: I, dtype: float64
df[“Ⅱ”] df.Ⅱ也可以
>>> studentInfo = pd.DataFrame({'ID': ID, 'Name': name, 'Sex': sex, 'Age': age, 'Class': rank})
>>> otherInfo = pd.DataFrame({'ID': ID, 'GPA': GPA, 'Financial_Aid': aid})
>>> mathInfo = pd.DataFrame({'ID': mathID, 'Grade': mathGd, 'Math_Major': major })
isin() is a useful way to find certain values in a DataFrame. It compares the input (a list, dictionary, or Series) to the DataFrame and returns a boolean Series showing whether or not the values match. You can then use this boolean array to select appropriate locations.
>>> # SELECT ID, Age FROM studentInfo
>>> studentInfo[['ID', 'Age']]
>>> # SELECT ID, GPA FROM otherInfo WHERE Financial_Aid = 'y'
>>> otherInfo[otherInfo['Financial_Aid']=='y'][['ID', 'GPA']]
>>> # SELECT Name FROM studentInfo WHERE Class = 'J' OR Class = 'Sp'
>>> studentInfo[studentInfo['Class'].isin(['J','Sp'])]['Name']
# SELECT * FROM studentInfo INNER JOIN mathInfo ON studentInfo.ID = -
mathInfo.ID
>>> pd.merge(studentInfo, mathInfo, on='ID') # INNER JOIN is the default
Analyzing Data
可以直接np.log(z)
(x + y).dropna() 去掉NaN
x.fillna(0) 将NaN填充0
Data I/O
read_csv()
• delimiter: This argument specifies the character that separates data fields, often a comma ora whitespace character.
• header: The row number (starting at 0) in the CSV file that contains the column names.
• index_col: If you want to use one of the columns in the CSV file as the index for the DataFrame,set this argument to the desired column number.
• skiprows: If an integer n, skip the first n rows of the file, and then start reading in the data.If a list of integers, skip the specified rows.
• names: If the CSV file does not contain the column names, or you wish to use other columnnames, specify them in a list assigned to this argument.
df.to_csv(“my_df.csv”)