Skip to content

atomic introduce autodiff

hgratten requested to merge atomic_autodiff into master

this merge request introduces the unfinished autodiff assignment in Python. see the added readme.md (copy):

todo for autodiff

this is an unfinished/work in progress assignment to cover the automatic differentiation technique. this document gives an overview of the next steps. for diff theory consult the lecture document.

finish proof-of-concept in python

() assignment part: introduce optimization via gradient descent instead of newtons method

let the student implement gradient descent using a simple adaptive step size scheme. explain the necessity of using an adaptive step size. then compare both methods w.r.t. overall optimization time and steps.

solution: gradient descent steps are faster, but it takes much longer overall (with such a simple adaptive scheme). this is because the newton steps on a quadratic landscape are a lot smarter. here, once the landscape in proximity of the optimum is almost quadratic, we have almost instant convergence.

assignment part: choose a more realistic landscape

optimization landscapes found in learning are generally not nice and quadratic. this time let the student implement the whole function body. use a function which does not behave as nicely for newtons method. ideally, it should not require a more sophisticated adaptive step size scheme for gradient descent. also explain why this time, it is necessary to use a step size on newtons method on this landscape. let the student observe what happens.

solution: newtons method performs a lot worse in comparison on non-quadratic landscapes and needs to be controlled via step size.

(done partly) assignment part: deep learning

now that we have seen gradient descent we can implement a rudimentary deep learning algorithm. let the student implement the interesting parts of a skeleton using their optimization knowledge from the previous exercises. (see function deep_learning in solution.py)

assignment part: autodiff optimization

work towards an assignment part where the student improves/augments some part of the autodiff framework or user code. this could be implementing an operator and making it run faster or hinting to the framework some const-w.r.t-type information. for example, data tensors are always const w.r.t optimized weights. further, most gradient calculations can be sped up if they consist of larger operators. i.e. a dedicated operator for multiplying a constant matrix and a vector has a known closed form derivative w.r.t to the vector. using the standard implementation without any assumptions should be a lot slower. it may be educational to find an example where this type of "trick" is used, s.t. students can get a feeling of the type of optimization in autodiff libraries.

(optional) optimize diff calculation using caches

to achieve the asymptotic performance of backpropagation in deep learning, it is crucial to cache dy/dx differentials by (y,x) such that intermediate results in the chain rule can be reused

(optional) assignment part: vary constant step size

let the student explore what happens if you decrease/increase the step size/learning rate. this is a trade-off between oscillation and stagnation. ideally handled automatically using a learning rate scheduler/ adaptive step size. relate to integration chapter.

(optional) assignment part: vary the batch size

let the student explore what happens if you decrease/increase the batch size. this is a trade-off between slower gradient step computation and more accurate gradient step computation. too small batches with too high a step size can cause oscillation near the optimum. this is usually not varied during the optimization, since its optimum depends very much on the hardware it is running on. trial and error.

(optional) assignment part: relative mean absolute error

is it possible to achieve negative RMAE i.e. beat the (so-called) optimal error? initialize the weights with the true weights and observe what happens. solution: yes, it may shift weights (thus overfit) to the samples slightly, even though the architecture is well chosen, i.e. not powerful enough to interpolate the samples.

(optional) assignment part: vary the sample size

if you decrease the sample size significantly and keep batch size, step size constant does the loss increase? why? is this intuitive? what would be an alternative way to measure/compare the final performance? the idea with this section is to think "outside of the metric", i.e. to question the error/loss as an evaluation metric. as an open-ended question, students may come up with the idea of an independent test set. although this already statistics/data science and may be out of scope.

(optional) theory: further optimization

since this is a toy implementation, mention other types of performance optimizations which are actually done/imaginable in autodiff frameworks and not implemented here, e.g. optimizing the computation graph, GPU acceleration

porting to c++

look for a C++ library to implement an n-dimensional array with analogue functionality, e.g. xtensor

numpy functions

  1. np.random.rand

    • Generate random numbers in a uniform distribution.
  2. np.ndarray

    • Create uninitialized multidimensional arrays.
  3. np.sum

    • Compute the sum of array elements.
  4. np.prod

    • Compute the product of array elements or dimensions.
  5. np.array

    • Create arrays from Python sequences.
  6. np.log

    • Compute the natural logarithm element-wise.
  7. np.tanh

    • Compute the hyperbolic tangent element-wise.
  8. np.sqrt

    • Compute the square root element-wise.
  9. np.abs

    • Compute the absolute value element-wise.
  10. np.vectorize

    • Vectorize Python functions for element-wise operations.
  11. np.eye

    • Create 2D identity matrices.
  12. np.empty

    • Create uninitialized arrays.
  13. np.ndindex

    • Generate multi-dimensional indices for array iteration.
  14. np.reshape

    • Reshape arrays to specified dimensions.
  15. np.zeros

    • Create arrays filled with zeros.

np.ndarray Attributes and Methods Used

Attributes

  1. np.ndarray.shape

    • Retrieve the shape (dimensions) of the array.
  2. np.ndarray.dtype

    • Retrieve the data type of the array.

Methods

  1. Indexing

    • np.ndarray.__getitem__ – Access elements or slices of the array.
    • np.ndarray.__setitem__ – Assign values to elements or slices.
  2. Iterators

    • np.ndarray.flat – A flat iterator over all elements of the array.
  3. Arithmetic Operators

    • np.ndarray.__add__ – Element-wise addition: array1 + array2.
    • np.ndarray.__sub__ – Element-wise subtraction: array1 - array2.
    • np.ndarray.__mul__ – Element-wise multiplication: array1 * array2.
    • np.ndarray.__truediv__ – Element-wise division: array1 / array2.
  4. Representation

    • np.ndarray.__repr__ – String representation for debugging.
    • np.ndarray.__str__ – String representation for display.

(optional) memory management by reference counting

many tensors are intermediate results captured implicitly in differentiation rule lambdas. but these dependencies are always acyclic, since differentiation rules are acyclic. thus tensor memory can be managed by reference counted shared pointers

Merge request reports

Loading