Edge AI and Vision Alliance

Vector Databases: Unlock the Potential of Your Data

Brian Dipert — Mon, 09 Oct 2023 22:01:16 +0000

This blog post was originally published at Tenyks’ website. It is reprinted here with the permission of Tenyks.

In the field of artificial intelligence, vector databases are an emerging database technology that is transforming how we represent and analyze data by using vectors — multi-dimensional numerical arrays — to capture the semantic relationships between data points.

‍In this article, we begin by defining what is a vector database. We compare some of the top companies offering vector database solutions. Then, we highlight how vector databases differ from relational, NoSQL and graph databases. We illustrate with an example how vector databases work in action. Finally, we discuss what might be on the horizon for this technology.

What is a Vector DB

In essence, a vector database is a special-purpose database to store and manage embedding vectors. It’s optimized for fast similarity searching and relationship detection in applications such as image search, recommender systems, text understanding and many more.

‍Machine learning has enabled the transformation of unstructured data into vector representations that capture meaningful relationships within the data. These vector representations, called embeddings, are used for data analysis and power many machine learning applications.

‍For instance, [10] highlights how recommender systems commonly use vector embedding techniques like item2vec [1], word2vec [2], doc2vec [3] and graph2vec [4] to convert items into vectors of numeric features. Recommendations are then generated by identifying the items with the most similar vector representations. Images [5] and natural language also have inherent vector-based representations due to their numeric pixel and word components.

‍Vector databases originate from vector similarity search, where early systems [6, 7] were capable of similarity queries but lacked performance at scale with dynamic vector data. The first solutions for similarity search were either algorithms (i.e. libraries) [8] or systems [9]. The former (e.g. FAISS from Facebook) handle large volumes of data poorly, assuming all data and indexes fit into main memory. The latter (e.g. Alibaba AnalyticDB-V) are not a good fit for vector data and do not really focus on vectors as first-class data types.

‍Given these issues, purpose-built vector database solutions emerged, such as Milvus [10]. Milvus is a vector data management system built on top of FAISS that overcomes previous solutions’ shortcomings. It is designed specifically for large-scale vector data and treats vectors as a native data type.

‍Unlike a traditional relational database (i.e. MySQL), a vector database represents information as vectors — geometric objects that encode the relationship between data points.

‍Microsoft defines a Vector DB as follows:

A vector database is a type of database that stores data as high-dimensional vectors, which are mathematical representations of features or attributes. Each vector has a certain number of dimensions, which can range from tens to thousands, depending on the complexity and granularity of the data.

‍Why are relational databases not enough? Relational databases are ill-suited for modern machine learning applications that require fast, complex pattern analysis across large datasets. While relational databases excel at table-based storage and querying, their tabular data model cannot capture the semantic relationships between data points required for ML.

‍To have a complete picture of a vector database, it’s helpful to define what is a vector embedding and a vector model.

Vector embedding

Vector embeddings are the representations of data stored and analyzed in vector databases. These vectors place semantically similar items close together in space, and dissimilar items far apart.

‍These (vector) embeddings can be produced for any kind of information — words, phrases, sentences, images, nodes in a network, etc. Once you have vector embeddings for your data, algorithms can detect patterns, group similar items, find logical relationships, and make predictions.

Vector embedding example using Star Wars characters

The previous figure shows an embedding representation of Star Wars characters, learned from analyzing patterns in dozens of Star Wars books. This embedding space could be used as follows:

Cluster characters into groups like “Jedi”, “Sith”, “ Droids” etc. based on vector proximity.
For a character like Yoda, the nearest neighbors in the vector space may be other Jedi masters (i.e. Luke), indicating an affiliation we could infer even with no label for the given cluster.
Find edge-cases, e.g. Anakin Skywalker can be on the intersection of Jedi & Sith -even though we know his final form is more akin to Sith & Droid when he is fully led into to the dark side.

‍Different embeddings will compute different underlying similarity measures, see the following figure. For example, CLIP can compute the high-level semantic similarity of concepts like “Jedi” and “Sith”, whereas other embeddings, such as PCA, may compute lower-level similarities, such as shapes or colours.

A different vector embedding space of the same Star Wars characters

Embedding model

Vector databases use embedding models as a key component for translating data into vector formats optimized for similarity search and pattern analysis. The embedding models produce the vector representations that vector databases are built to store, query and analyze.

‍Some ways embedding models work with vector databases include:

Vector databases rely on embedding models to encode data such as words, images, knowledge graphs, etc. into numeric vector representations.
Because embedding models map semantically related items close together in vector space, vector databases can perform rapid vector similarity searches.
Embedding models map sparse data into lower-dimensional dense vectors, which vector databases are optimized to work with.

‍Vector embeddings, embedding models and vector databases work together to provide an end-to-end solution for generating, storing, and using vector data to power AI applications.

Top Vector DB technology providers

Top Vector database providers available in the market

Weaviate is an open-source vector database. It allows you to store data objects and vector embeddings from your favorite ML-models, and scale seamlessly into billions of data objects.

‍Elastic is a distributed, free and open search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured.

‍Milvus is a vector database created in 2019 with a singular goal: store, index, and manage massive embedding vectors generated by deep neural networks and other machine learning (ML) models.

‍Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors.

‍Pinecone is a vector database that makes it easy to build high-performance vector search applications. Developer-friendly, fully managed, and easily scalable without infrastructure hassles.

‍Chroma is a database for building AI applications with embeddings. It comes with everything you need to get started built in, and runs on your machine.

How Vector DBs compare to other kinds of DBs

Vector databases excel in its particular niche: handling embedding vectors at scale. The following table shows some of the differences between Vector DBs and other types of databases.

Comparing Vector databases with other kinds of databases

Bear in mind that while this table provides a general overview, there can be specific databases within each category that have unique features and characteristics.

A practical showcase on Vector DB

At Tenyks we rely on vector databases to store millions of embeddings entries in our system. As we help companies in identifying edge cases and outliers, we depend on vector embeddings to represent their data for these use cases.

‍Vector databases are a perfect complement to state-of-the-art models like CLIP that produce rich, information-dense vector embeddings. These embeddings frequently have hundreds of dimensions to capture complex relationships, but vector databases can search and analyze them with ease.

‍The Tenyks platform performs lightning-fast semantic searches across enormous volumes of vector data. This powers capabilities such as rapid embedding search for image/text similarity.

‍Here’s (video download link) an example of a use case of vector databases in action. Using the BDD dataset, a driving dataset, we are interested in finding images of white cars. The snippet shows how the Tenyks platform allows you to find similar images given certain input in the form of text. In this case, after entering the text: “white car” on the search input bar, our similarity feature outputs images from this dataset that contain white cars.

Future outlook

Vector databases are likely to become commodities as demand grows for managing machine learning vector data at scale. They provide the performance, scale, and flexibility that AI applications require across industries.

‍Unlike other databases, vector databases were created specifically for vector embeddings and neural networks applications. They introduce a vector-native data model and query language providing functionality beyond SQL or graphs. As machine learning enriches use-cases that understand the world through vectors, vector databases deliver the data solution to gain insights from them.

‍Vector databases exhibit characteristics of both commodities and novel technologies. They are becoming commonplace for enterprises developing AI but represent a new database with a vector-first architecture no other technology provides.

References

‍Authors: Jose Gabriel Islas Montero, Dmitry Kazhdan

The post Vector Databases: Unlock the Potential of Your Data appeared first on Edge AI and Vision Alliance.

The Guide to Fine-tuning Stable Diffusion with Your Own Images

Brian Dipert — Mon, 09 Oct 2023 18:49:35 +0000

This article was originally published at Tryolabs’ website. It is reprinted here with the permission of Tryolabs.

Have you ever wished you were able to try out a new hairstyle before finally committing to it? How about fulfilling your childhood dream of being a superhero? Maybe having your own digital Funko Pop to use as your profile picture? All of these are possible with DreamBooth, a new tool developed by researchers at Google that takes recent progress in text-conditional image synthesis to the next level.

In our previous post, we discussed text-to-image generation models and the massive impact that models like DALL·E and Stable Diffusion are having throughout the Machine Learning community.

Now, in this blog post, we will guide you through implementing DreamBooth so that you can generate images like the ones you see below. To do so, we’ll implant ourselves into a pre-trained Stable Diffusion model’s vocabulary. Be warned, generating images of yourself (or your friends) is highly addictive. Don’t say we didn’t warn you!

Also, if you know part of our team, you may recognize some faces in the following images.

DreamBooth motivation

Feel free to skip this section if you’re not particularly interested in the theory behind the approach and prefer to dive straight into the implementation.

The first step towards creating images of ourselves using DreamBooth is to teach the model how we look. To do so, we’ll follow a special procedure to implant ourselves into the output space of an already trained image synthesis model.

You may be wondering why we need to follow such a special procedure. After all, these new generation image synthesis models have unprecedented expressive power. Can’t we just feed the model an extremely detailed description of the person and be done with it? The short answer is no. It’s still very hard for these models to reconstruct the key visual features that characterize a specific person. Instead, the model must learn what we look like down to the last detail so that it can later reproduce us in the most fictional scenarios.

To achieve this, we’ll fine-tune this model with a set of images, binding them to a unique identifier that references us.

But wait a minute… How many of these images will we need? Deep Learning models usually require large amounts of data to produce meaningful results (even more so these large image synthesis models). Does this mean that we need thousands of pictures of ourselves for the model to reproduce us faithfully?

Fortunately, the answer is no. The technique we’re about to show you achieves results like you have seen above with no more than a dozen images of your face. Still, these images must exhibit some variation in terms of different perspectives of your face (e.g., front, profile, angles in between), facial expressions (e.g., neutral, smiling, frowning), and backgrounds. Here are examples from the three victims we chose for this blog post: Fernando, Giuls, and Luna (from left to right).

Once you’ve collected these images, the next step is to label them with a text prompt. Following the instructions in DreamBooth’s paper, we’ll use the prompt A [token name] [class noun] where [token name] is an identifier that will reference us, and [class noun] is an already existing class in the model’s vocabulary which describes us at a high level. For instance, for Fernando Bernuy (co-writer and one of the victims of our experiment), a possible prompt would be A fbernuy man. Other examples of class nouns include woman, child, teenager, dog, or sunglasses. Yes, this approach works with animals and other objects too!

The motivation behind linking our unique identifier with a class noun during training is to leverage the model’s strong visual prior of the subject’s class. In other words, it will be much easier for the model to learn what we look like if we tell it that we are a person and not a refrigerator. The authors of DreamBooth found that including a relevant class noun in the training prompts decreased training speed and increased the visual fidelity of the subject’s reproduced features.

However, there are still two issues we must address before we can fine-tune the model:

The first one is overfitting: these extremely large generative models will inevitably overfit such a small set of images, no matter how varied it may be. This means that the model will learn to reproduce the subject with high fidelity, but mostly in the poses and contexts present in the training images.

Prior-preservation loss acts as a regularizer that alleviates overfitting, allowing pose variability and appearance diversity in a given context. Image and caption from DreamBooth’s paper.

The second is language drift: since the training prompts contain an existing class noun, the model forgets how to generate different instances of the class in question. Instead, when prompted for a [class noun], the model returns images resembling the subject on which it was fine-tuned. Essentially, it replaces the visual prior it had for the class with the specific subject that we introduced into its output space. And although Fernando is a handsome man, not all men look like him!

Language drift. Without prior-preservation loss, the fine-tuned model cannot generate dogs other than the fine-tuned one. Image taken from DreamBooth’s paper.

To solve both issues, the authors of DreamBooth propose a class-specific prior-preservation loss. Simply put, the idea is to supervise the fine-tuning process with the model’s own generated samples of the class noun. In practice, this means having the model fit our images and the images sampled from the visual prior of the non-fine-tuned class simultaneously. These prior-preserving images are sampled and labeled using the [class noun] prompt. This helps the model remember what a generic member of the subject class looks like. The authors recommend sampling a number of 200×N [class noun] images, where N stands for the number of images of the subject.

Training approach. The subject’s images are fitted alongside images from the subject’s class, which are first generated using the same Stable Diffusion model. The super resolution component of the model (which upsamples the output images from 64 x 64 up to 1024 x 1024) is also fine-tuned, using the subject’s images exclusively. Image taken from DreamBooth’s paper.

Now that we’ve covered all the relevant pieces of the theory, all that’s left is to fine-tune the image synthesis model. Let’s do it!

Fine-tuning stable diffusion with your photos

Three important elements are needed before fine-tuning our model: hardware, photos, and the pre-trained stable diffusion model.

The original implementation requires a large amount of GPU resources to train, making it difficult for common Machine Learning practitioners to reproduce. However, a community in discord has developed an unofficial implementation that requires less computing resources. If you happen to have access to a machine with at least 16GB VRAM GPU, you can easily train your model following Hugging Face’s DreamBooth training example instructions. If you don’t, we’ve got you covered! In this post, we’ll show you how to train and run inference in a free-tier Google Colab. Yes, you’ve read that right, a free-tier Google Colab!

Note that the notebook used may be outdated due to the rapid advancements in the libraries used, but it has been tested and confirmed to still be functional. January 2022.

The second element is the subject’s photos. In this tutorial, we’re gonna use pictures of members of the TryoGang and one of our pets. In any case, there are some rules we need to follow to get the best possible results.

As mentioned in the motivation section, Stable Diffusion tends to overfit the training images. To prevent this, make sure that the training subset contains the subject in different poses and locations. Even though the original paper recommends using 4 to 6 images, the community in Discord has found that using 10 to 12 images leads to better results. As a rule of thumb, we’ll use 2 images that include the torso and 10 of the face, with different backgrounds, styles, expressions, looking and not looking at the camera, etc.

If you’re looking at the camera and smiling in every photo, don’t expect the model to generate you looking sideways or with a neutral face, so avoid using selfies only!

In addition, make sure to crop the training images to a square ratio since Stable Diffusion scales them down to 64 x 64 to use them for training.

And last but not least, we’ll need the pre-trained Stable Diffusion model’s weights. These can be downloaded from Hugging Face, for which we’ll need to create an account, read the model card and accept the terms and conditions. Don’t download the model manually because the training script will do it automatically.

Now that we’ve got everything set up, let’s fine-tune the model!

Training

We will use this implementation that includes a notebook ready to use in Google Colab. You can open the notebook by clicking on this link.

Before running it, let’s modify it for our use case (we’ll use Fernando as the subject to illustrate the instructions). We need to define four parameters for the training process:

TOKEN NAME: corresponds to the unique identifier which will reference the subject we want to add. This name should be unique, so we don’t have to compete with an existing representation. Here we can use a simple first initial + last name token name, such as fbernuy.
CLASS NAME: This is the class name we introduced in the motivation section. The original DreamBooth paper recommends using generic classes such as man, woman, or child (if the subject is a person) or cat or dog (if the subject is a pet). However, the Discord community implementing the approach on Stable Diffusion has found that using celebrities who are similar to the subject produces better results. In our case, we used George Clooney when the subject is a man and Jennifer Anniston when it’s a woman. We still used the “cat” class for Luna, as we couldn’t think of a suitable famous cat other than Garfield.
NUMBER OF REGULARIZATION IMAGES: As mentioned in the motivation section, we need the class-specific prior-preservation loss to prevent overfitting and language drift issues. We followed the original authors’ recommendation of using 200 images per training image. Remember that using more regularization images may lead to better results.
TRAINING ITERATIONS: This parameter defines the number of iterations the model will run during the fine-tuning process. If this number is too low, the model will underfit the subject’s images and won’t be able to reproduce it accurately during inference. If it’s too high, the model will overfit instead, making it unable to reproduce the subject with expressions, poses, or contexts outside of those in the training subset. A rule of thumb that has shown good results in our experiments is to use between 100 and 200 iterations per training image. Since we have 12 images of Fernando, let’s use 2400 iterations.

Now let’s modify the notebook with these parameters as follows:

Settings and run: we’ll modify the CLASS_NAME to georgeclooney. Also, we’ll replace the default sks token name with fbernuy in the INSTANCE_DIR and OUTPUT_DIR. This will make it easier to identify the directory in which the model and the data will be saved.
Start Training:

# replace the instance_prompt parameter to our token name: --instance_prompt=="photo of fbernuy george clooney" # check that the class_prompt is set as: --class_prompt="photo of {CLASS_NAME}" # set: --num_class_images=200 --max_train_steps=2400 --gradient_accumulation_steps=2 --lerning_rate=1e-6

Now we are ready to run the notebook and fine-tune our model. The first few cells will install the required dependencies. After this, we’ll be prompted to log in to HuggingFace using our access token.

Then, we’ll be asked to upload the subject’s photos. Here, can use the Choose Files button and select the images from our computer or upload them directly to the subject’s directory inside the data folder in the Colab instance. The next cell is where the magic happens. We finally get to fine-tune the model! The script will download the pretrained model’s weights, generate the regularization images, and then execute the specified number of training iterations. The entire process should take about an hour and a half, so be patient. Remember to keep an eye on the notebook!

Once training is over, we’ll be prompted to convert the model to a ckpt file. This is highly recommended since it’s a requirement for an extremely useful web interface that we’ll introduce further down in this blog post. Once we’ve saved the ckpt file in the notebook instance, we’ll download it to our local machine or save it to our drive folder.

We can test our fine-tuned model by running the cells below the “Inference” section of the notebook. The first cell loads the model we just trained and creates a new Stable Diffusion pipeline from which to sample images. We can set a seed to control random effects in the second cell. And now, the moment you’ve been anticipating since you started reading this blog post: generating our custom images!

The cell titled “Run for generating images” controls the image-generating process. There’s a total of 7 parameters that we can modify to customize our image:

prompt: the text prompt that will guide the image’s generation. Here’s where we should include the token name that references our subject.
negative_prompt: serves to specify what we don’t want to see in the image. For instance, if we want to generate an image with a cloudy sky, we enter clear sky as the negative prompt.
num_samples: the number of images the model will generate in a single batch.
guidance_scale: also known as CFG Scale, is a float that controls how much importance is given to the input text prompt. Lower values of this parameter will allow the model to take more artistic liberties when generating the images.
num_inference_steps: the number of denoising steps that the model will run. A higher number of steps will usually lead to more detailed images at the cost of an increased inference time. Be careful with this parameter, though, since too many steps may lead to visual artifacts in the images.
height: the height of the generated image in pixels.
width: the width of the generated image in pixels.

There’s no magic formula to generate the perfect image, so you’ll probably have to play around with these parameters for a while before achieving the results you want. If you’re having trouble generating cool images, don’t get discouraged! Some of the most common issues have pretty straightforward solutions, according to Joe Penna (one of the managers at the Stable Diffusion Discord channel).

If they don’t look like the subject: Check to see if the prompt is right and if the images follow the tips we gave before. Try including the class name in the prompt and the token name (i.e., a photo of TOKEN_NAME georgeclooney). We may also need to train for more iterations.
If they look too much like the training images: we might have trained for too long, used too few images, or our images may be too similar. We modify the prompt by including the token name towards the end of it, for instance: an exquisite portrait photograph, 85mm medium format photo of TOKEN_NAME with a classic haircut.
If using a complex prompt doesn’t give us the desired results: we might have trained for too few iterations. We can try repeating the token name in the prompt, for instance: TOKEN_NAME in a portrait photograph, TOKEN_NAME in an 85mm medium format photo of TOKEN_NAME.

Although the notebook is extremely useful for training the model, it’s far from being the best platform to generate images. In the following section, we’ll introduce an incredibly powerful tool to enhance the image generation process further.

In practice: generating cool images

Creating great images requires both practice and patience. However, this process can be alleviated by using the right tools. The one we’re about to show you is truly mind-blowing; it’s so versatile that we can’t recommend it enough! It’s a WebUI that makes the entire process more interactive and fun.

To use it, we must run a web server and follow the Install instructions available for Linux, Windows, or Apple Silicon. Alternatively, we can run the server on another Colab using this link. Beware that time flies when generating images, and Colab’s free tier is limited!

Once installed, we’ll copy our model’s ckpt file in the web server folder, stable-diffusion-webui/models/Stable-diffusion, and then run the web server script (webui.sh or webui.bat). This gives us the UI’s address and port so we can open it using our favorite browser.

WebUI tool for Stable Diffusion, from AUTOMATIC1111

The UI has many different features. We highly recommend exploring the project’s wiki. The development of Stable Diffusion and this UI are moving fast, so be aware that this may change!

The first thing we need to do is to select our fine-tuned Stable Diffusion model. At the top of the WebUI page, we’ll find a drop-down menu with all the available ckpt files. If you don’t see yours in the list, verify that you copied the ckpt file to the correct directory.

For this tutorial, we’ll focus on explaining the UI’s main three functionalities: text to image, image to image, and inpainting.

Text to Image (txt2img)

Text to image is the most straightforward way to use our model: write a prompt, set some parameters, and voilà! The model generates an image that matches the prompt according to the chosen parameters.

This might sound easy at first glance. However, we might need to try several parameter combinations before hitting the spot. Based on our experience, these are the steps we recommend following to generate the coolest images:

Pick a style from lexica.art and add your subject to its prompt. For instance, let’s see what Fernando would look like with a new haircut: fbernuy. epic haircut. hairstyling photography.
Use a random seed until you get something similar to what you have in mind. It might not look exactly like the subject, but we can fix that later.
Copy the seed from the image description and use it to generate the same image with different parameters. The best way to do this is to use the X/Y plot script: select a list of steps (10, 15, 20, 30) and a list of CFG Scales (2.0, 2.5, 3.0, 3.5, 4.0). The tool will plot a matrix with one image for each input step and scale combination. We can also use other parameters as the X and Y variables.
Then, pick the one you like the most, copy its corresponding parameter values, and remove the script to generate the selected image alone. If you don’t like any of the images, try with different parameters, a different seed, or a different prompt!

Selected random image

Parameters exploration

Final result

Image to Image (img2img)

The second alternative is to generate a new image based on an existing image and a prompt. The model will modify the entire image, so we can apply new styles or make a small retouch.

Let’s start with a txt2img prompt: very very intricate photorealistic photo of a fbernuy funko pop, detailed studio lighting, award - winning crisp details. Following the strategy explained above, we use txt2img and generate undoubtedly cool looking Funko Pop. However, we’d like to improve the beard to be closer to our subject and lighten the nose color.

To do this, we’ll click on the Send to img2img button and manually draw the beard style and nose we want using the MS Paint-like tool of the WebUI (center). We can reduce the denoising strength parameter to have a result as similar as possible to the original and experiment with the rest of the usual parameters until we get the result we are looking for (right).

txt2img generated image	simple image modifications	img2img result

Following the same img2img strategy, we slightly improved Luna’s fur colors in this epic picture and added some smile lines to the anime version of Giuls.

txt2img generated images

img2img improved image

Inpainting

The third alternative allows us to specify a region in the image for our model to fill, maintaining the rest of the image intact (unlike the img2img method, which modifies the entire input image). This can be useful for swapping a face in an existing photo (if the subject is a person) or generating an image of the subject in a different scenario or lighting condition while preserving the background and context. Keep in mind that using this method is a bit more challenging because there are more parameters to explore.

For example, let’s generate an image of Fernando as Ironman. Since the armor has a lot of important details, we’ll use an original image from the movie poster as the source and swap Ironman’s face using the Inpainting tool.

The first thing we’ll do is select the Inpainting tool inside the img2img tab. After uploading our reference image, we’ll select the area around the head with the brush tool and input a photo of fbernuy as the prompt since we don’t want the model to fill this region with anything else but Fernando’s face.

Before generating the image, let’s take a look at the most relevant parameters added in inpaint.

Masked content: defines what to fill the masked region with. We can select original (the default) if the original content is similar to what we want to achieve, experiment with fill to help us keep the surrounding information, or latent noise to use noise. Regardless of the option we pick, random noise will be added based on the Denoising strength parameter.
Denoising strength: defines the standard deviation of the random noise added to the masked region. The higher this parameter, the lower the similarity with the content in the unmasked portion of the image.
Inpaint at full resolution: inpainting resizes the whole image to the specified target resolution by default. With this parameter enabled, only the masked region is resized, and the result is pasted back into the original picture. This helps get better results for small masks as the inpainted region is rendered at a much larger resolution.

For this example, we’ll use original masked content (since the masked region is already a face) with 0.50 denoising strength and enable inpainting at full resolution. Then, we’ll set a random seed -1 and repeat the process we’ve done before: patiently generate images until we get one similar to what we desire. Finally, we’ll fix the seed and use the X/Y plot script to explore different Sampling Steps and CFG Scale combinations.

Original image	Intermediate inpaint results

Pretty awesome, right? At this point, we’ve generated a great image that kept all the details of the original picture but with Fernando’s face instead of Robert Downey Jr.’s. Still, there’s one small detail we want to fix in the beard.

The best way to fix this is by using inpainting again, but using the already inpainted image instead of the original (didn’t see that one coming, did you?). This way, we can instruct the model to modify the region around the beard exclusively and input a more specific prompt, such as a photo of fbernuy with a beard.

Final inpaint result with beard details

We have shown you how to create cool images of you, your friends, your pets, or any particular item you want, either starting from just an idea, a sketch, or an existing image!

Now you are ready to generate cool images on your own! Here are some images we generated from our subjects that can be useful for you to get some inspiration. Have fun!

Giuls in Game of Thrones	Luna with a birthday hat	Fernando, oil canvas


Fernando’s business portrait	Luna with sunglasses	Luna with pearl earrings

Final thoughts

Stable Diffusion signified one of the biggest leaps toward democratizing large image synthesis models. Techniques such as DreamBooth (and their community-driven implementations) allow us to reap the benefits of these models even further, with imagination being our only limit. We are extremely excited to know where this new democratic AI paradigm will lead us and the various ways in which the world will benefit from it.

Fernando Bernuy
Lead Machine Learning Engineer, Tryolabs

Guillermo Etchebarne
Lead Machine Learning Engineer, Tryolabs

The post The Guide to Fine-tuning Stable Diffusion with Your Own Images appeared first on Edge AI and Vision Alliance.

“MIPI CSI-2 Image Sensor Interface Standard Features Enable Efficient Embedded Vision Systems,” a Presentation from the MIPI Alliance

Brian Dipert — Mon, 09 Oct 2023 08:00:35 +0000

Haran Thanigasalam, Camera and Imaging Consultant to the MIPI Alliance, presents the “MIPI CSI-2 Image Sensor Interface Standard Features Enable Efficient Embedded Vision Systems” tutorial at the May 2023 Embedded Vision Summit. As computer vision applications continue to evolve rapidly, there’s a growing need for a smarter standardized interface connecting…

“MIPI CSI-2 Image Sensor Interface Standard Features Enable Efficient Embedded Vision Systems,” a Presentation from the MIPI Alliance

Register or sign in to access this content.

Registration is free and takes less than one minute. Click here to register and get full access to the Edge AI and Vision Alliance's valuable content.

The post “MIPI CSI-2 Image Sensor Interface Standard Features Enable Efficient Embedded Vision Systems,” a Presentation from the MIPI Alliance appeared first on Edge AI and Vision Alliance.

Sony Semiconductor Solutions Concludes Pedestrian Safety Challenge, Announces Winners with tinyML Foundation and The City of San José

Brian Dipert — Fri, 06 Oct 2023 23:22:49 +0000

Sony Reveals Leopard Imaging, NeurOHM, and King Abdullah University of Science and Technology, as winners of Tech for Good competition in support of the city’s Vision Zero initiatives.

SAN JOSÉ, Calif., Oct. 5, 2023 /PRNewswire/ — Today, Sony Semiconductor Solutions America (SSS-A), alongside the tinyML Foundation and The City of San José, announced the final winners for the Pedestrian Safety Challenge Hackathon competition, which began in May as an effort to reduce pedestrian-involved accidents, in connection with the city’s Vision Zero initiatives.

In collaboration, the three groups joined together to encourage teams across the globe to solve for this issue, as pedestrian injuries and fatalities have become more common with issues like distracted driving, distracted walking, illegally crossing roadways, and more.

The hackathon boasted 29 participating teams from across the globe, including the United States., Germany, Lebanon, Nigeria, and Saudi Arabia, as well as teams local to the Silicon Valley and the San Francisco Bay Area (SFBA).

Mark Hanson, Vice President and Head of Marketing for System Solution Business Development at SSS-A states, “It was a pleasure to partner with tinyML and the City of San José on the important issue of pedestrian safety, especially as a native resident and with Sony Electronics’ office in the city. The groundbreaking, people-first solutions coming from these teams makes us optimistic, not just in local Vision Zero efforts, but to see these technologies be used to benefit communities around the globe.”

First place was awarded to the Leopard Imaging team, presenting a solution that features SSS’s AITRIOS platform and IMX500-enabled hardware, with the NeurOHM team as second place, team from King Abdullah University of Science and Technology (KAUST) in third place, and special Edge Impulse award going to the KAUST team.

Evgeni Gousev, Senior Director at Qualcomm, and Chair of the Board of Directors at tinyML Foundation says, “As a global non-profit organization with a mission to accelerate development and adoption of energy-efficient, sustainable machine learning technologies, we were enthusiastic for this collaboration with the City of San José, Sony, and other partner companies. We were very pleased to see a strong response from the tinyML Community, are grateful to all the teams and participants who have contributed their ideas and proposals for this real-world problem and would like to congratulate the finalists on delivering innovative-yet-practical solutions.”

Hanson continues, “It was very exciting for us that Leopard Imaging entered with an AITRIOS-built solution and won first place in the Hackathon. It shows that vision AI tools, like AITRIOS, can make these Vision Zero and pedestrian safety goals a tangible, low-cost, and scale-based platform to support these initiatives.”

“Through our partnership with Sony and tinyML, brilliant minds from across the world have generated ideas that will ultimately save lives in San José and beyond,” said San José Mayor, Matt Mahan.

To learn more about the Pedestrian Safety Challenge and its winning solutions, please visit the tinyML Foundation website, here.

About Sony Semiconductor Solutions America

Sony Semiconductor Solutions America is part of Sony Semiconductor Solutions Group, today’s global leader in image sensors. We strive to provide advanced imaging technologies that bring greater convenience and joy to people’s lives. In addition, we also work to develop and bring to market new kinds of sensing technologies with the aim of offering various solutions that will take the visual and recognition capabilities of both humans and machines to greater heights. Visit us at: https://www.sony-semicon.co.jp/e/

Corporate slogan “Sense the Wonder”: https://www.sony-semicon.co.jp/e/company/vision

The post Sony Semiconductor Solutions Concludes Pedestrian Safety Challenge, Announces Winners with tinyML Foundation and The City of San José appeared first on Edge AI and Vision Alliance.

“Introduction to the CSI-2 Image Sensor Interface Standard,” a Presentation from the MIPI Alliance

Brian Dipert — Fri, 06 Oct 2023 08:00:02 +0000

Haran Thanigasalam, Camera and Imaging Consultant to the MIPI Alliance, presents the “Introduction to the MIPI CSI-2 Image Sensor Interface Standard” tutorial at the May 2023 Embedded Vision Summit. By taking advantage of select features in standardized interfaces, vision system architects can help reduce processor load, cost and power consumption…

“Introduction to the CSI-2 Image Sensor Interface Standard,” a Presentation from the MIPI Alliance

Register or sign in to access this content.

Registration is free and takes less than one minute. Click here to register and get full access to the Edge AI and Vision Alliance's valuable content.

The post “Introduction to the CSI-2 Image Sensor Interface Standard,” a Presentation from the MIPI Alliance appeared first on Edge AI and Vision Alliance.

Five Years for Vision Components and MIPI: New MIPI Camera Module for Highest Image Quality

Brian Dipert — Thu, 05 Oct 2023 18:01:27 +0000

Ettlingen, October 5th, 2023 – Five years ago, Vision Components presented the first MIPI cameras for industrial series applications. Today, the manufacturer from Ettlingen in Germany offers more than 20 different image sensors as MIPI modules. Brand new is the VC MIPI IMX585, which offers best image quality in all lighting conditions with 4K image resolution and high dynamic range. The company also announces that the VC Lib image processing software will soon be freely available to all customers.

For more information: www.mipi-modules.com

VC MIPI IMX585: 4K resolution and highest dynamic range

The VC MIPI IMX585 Camera is based on the Sony Starvis-2 IMX585 image sensor and offers an image resolution of 8.4 megapixels, 4K and HDR support. The sensor has larger pixels than comparable modules and delivers with its 88dB dynamic range high image quality in all lighting conditions. The camera with MIPI CSI-2 interface is thus ideally suited for AI-based medical applications and other demanding vision tasks. It can be configured as a color or monochrome camera and will be available in quantities towards the end of the year.

Designed for industrial mass production

The VC MIPI camera modules are developed and manufactured by Vision Components in Ettlingen near Karlsruhe, Germany. They offer high quality, robust and industry-optimized design as well as long-term availability. Via the MIPI CSI-2 interface, the MIPI camera modules can be connected to all common processor platforms. Corresponding drivers are provided by Vision Components free of charge.

Comprehensive accessories and individual sensors on request

Vision Components also supplies high-performance cables and accessories perfectly matching the MIPI cameras from a single source. The smart components enable vision OEMs to bring their projects to market faster, easier and more cost-efficiently. The manufacturer is continuously adding more image sensors to its VC MIPI portfolio, including for applications such as SWIR and 3D/ToF. Upon customer request, VC also integrates special sensors into MIPI modules, even those that do not natively support a MIPI interface.

VC Lib now open to all customers

In order to support customers even better in the integration of embedded vision, the VC Lib software library will be freely available to all customers of the company. Until now, it was reserved for customers of VC embedded vision systems. VC Lib includes basic functions for image processing applications as well as more complex functions such as pattern recognition or barcode reading. The applications are highly optimized for ARM processor platforms and enable fast and cost-effective development of end applications.

About Vision Components

Vision Components is a leading manufacturer of embedded vision systems with over 25 years of experience. The product range extends from versatile MIPI camera modules to freely programmable cameras with ARM/Linux and OEM systems for 2D and 3D image processing. The company was founded in 1996 by Michael Engel, inventor of the first industrial-grade intelligent camera. VC operates worldwide, with sales offices in the USA, Japan and Dubai as well as local partners in over 25 countries.

The post Five Years for Vision Components and MIPI: New MIPI Camera Module for Highest Image Quality appeared first on Edge AI and Vision Alliance.

How NVIDIA and e-con Systems are Helping Solve Major Challenges In the Retail Industry

Brian Dipert — Thu, 05 Oct 2023 12:17:04 +0000

This blog post was originally published at e-con Systems’ website. It is reprinted here with the permission of e-con Systems.

e-con Systems has proven expertise in integrating our cameras into the NVIDIA platform, including Jetson Xavier NX / Nano / TX2 NX, Jetson AGX Xavier, Jetson AGX Orin, and NVIDIA Jetson Orin NX / NANO. Find out about how our cameras are integrated into the NVIDIA platform, their popular use cases, and how they empower you to solve retail challenges.

In the retail industry, there are numerous challenges, including security risks, inventory management, and enhancing the shopping experience. NVIDIA-powered cameras are helping to address these challenges by providing retailers with real-time data and insights. In addition, these cameras are being used to enhance store security, optimize store layout and staffing, etc.

So, by leveraging the power of the NVIDIA platform, retailers can better understand their customers while improving operations and ultimately providing a more satisfying shopping experience.

In this blog, let’s discover more about the role of e-con Systems’ cameras integrated into the NVIDIA platform, how they help solve some major retail challenges, and their most popular use cases.

Read: e-con Systems launches 3D time of flight camera for NVIDIA Jetson AGX Orin and AGX Xavier

A quick introduction to NVIDIA and e-con Systems’ cameras

NVIDIA has been involved in developing camera sensors for various applications, focusing on AI-powered edge computing and autonomous vehicles. One of their most notable releases is the Jetson Nano Developer Kit (released in 2019). This System-on-Module (processor) is designed for AI-powered edge computing applications like object recognition and autonomous shopping.

As you may already know, e-con Systems has proven expertise in integrating our cameras into the Nvidia platform. We support the entire NVIDIA Jetson family, including Jetson Xavier NX / Nano / TX2 NX, Jetson AGX Xavier, Jetson AGX Orin, and NVIDIA Jetson Orin NX / NANO. e-con Systems’ popular camera solutions come with advanced features, such as dedicated ISP, ultra-low-light performance, low noise, wide temperature range, LED flicker mitigation, bidirectional control, and long-distance transmission.

Benefits of using cameras powered by the NVIDIA platform

- They work seamlessly with their powerful GPUs, which are optimized for processing large amounts of data in real time. This allows for advanced image processing and analysis, making it possible for machines to “see” and understand their surroundings with greater accuracy and speed.
- They are capable of capturing high-quality data that can be used to train deep neural networks. So they can then be used for tasks such as object detection and recognition.
- They are designed to be low-power and compact, making them ideal for use in embedded vision applications. This is particularly important for applications such as smart trolleys and smart checkout systems.
- They are highly customizable, letting developers tailor them to specific applications and use cases. This flexibility makes it possible to create embedded vision solutions that are optimized for specific tasks and environments, providing better performance and reliability.

Major retail use cases of NVIDIA and e-con Systems

Smart Checkout

e-con Systems’ cameras, powered by the NVIDIA platform, are transforming smart checkout systems by enabling faster, more accurate, and more efficient checkout experiences for customers. Firstly, they can be used to enable contactless checkout, reducing the risk of transmission of infectious diseases. So, customers can avoid touching checkout equipment and interacting with cashiers, reducing the risk of transmission.

These smart checkout systems usually refer to a camera-enabled automated object detection system at the billing or checkout counter. They can operate autonomously with limited supervision from human staff – offering benefits like effective utilization of the retail staff, enhanced shopping experience, data insights on shopping patterns, and more. The integrated camera is equipped with smart algorithms to detect a wide variety of objects in a retail store.

Read: Key camera-related features of smart trolley and smart checkout systems

Smart Trolley

NVIDIA cameras are changing the game for retailers by providing real-time insights into customer behavior and preferences through the use of smart trolleys. These trolleys equipped with cameras and sensors help identify products or the barcode on each item – enabling the customers to pay in the same cart. This can greatly reduce wait times and improve overall customer satisfaction.

Moreover, the data collected by these cameras can enable retailers to offer personalized product recommendations and promotions based on past purchases and interactions. This personalized approach can increase sales and customer loyalty.

Another significant advantage of NVIDIA cameras in smart trolleys is enhanced store security. The cameras can detect and track suspicious activity in real time, such as items being removed from trolleys without payment or abandoned trolleys blocking store aisles.

Read: How embedded vision is contributing to the smart retail revolution

Other retail use cases include:

- Optimized store operations and improved inventory management: With real-time data on store traffic and product placement, retailers can make informed decisions about store layout, staffing, and inventory management, leading to more efficient operations and reduced costs.
- Personalized shopping experiences for customers: By analyzing customer behavior through imaging detail and preferences, retailers can offer personalized product recommendations and promotions. In turn, this leads to increased sales and customer satisfaction.

As the technology continues to evolve, it is likely that we will see even more innovative applications of NVIDIA-powered cameras in the retail industry.

NVIDIA and e-con Systems: An ongoing multi-year Elite partnership

NVIDIA and e-con Systems together have formed a one-stop ecosystem – providing USB, MIPI, GMSL, GigE, and FPD Link camera solutions across several industries and significantly reducing time-to-market. This multi-year Elite partnership started with Jetson Nano (40 TOPS) and continues strong with AGX Orin (100 TOPS).

Explore our NVIDIA Jetson-based cameras

If you are looking for an expert to help integrate NVIDIA cameras into your embedded vision products, please write to camerasolutions@e-consystems.com. You can also check out our Camera Selector page to get a full view of e-con Systems’ camera portfolio.

Ranjith Kumar
Camera Solution Architect, e-con Systems

The post How NVIDIA and e-con Systems are Helping Solve Major Challenges In the Retail Industry appeared first on Edge AI and Vision Alliance.

“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap

Brian Dipert — Thu, 05 Oct 2023 08:00:05 +0000

Dwith Chenna, Senior Embedded DSP Engineer for Computer Vision at Magic Leap, presents the “Practical Approaches to DNN Quantization” tutorial at the May 2023 Embedded Vision Summit. Convolutional neural networks, widely used in computer vision tasks, require substantial computation and memory resources, making it challenging to run these models on…

“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap

Register or sign in to access this content.

Registration is free and takes less than one minute. Click here to register and get full access to the Edge AI and Vision Alliance's valuable content.

The post “Practical Approaches to DNN Quantization,” a Presentation from Magic Leap appeared first on Edge AI and Vision Alliance.

FRAMOS Launches Event-based Vision Sensing (EVS) Development Kit

Brian Dipert — Wed, 04 Oct 2023 14:30:44 +0000

[Munich, Germany / Ottawa, Canada , 4 October] — FRAMOS launched the FSM-IMX636 Development Kit, an innovative platform allowing developers to explore the capabilities of Event-based Vision Sensing (EVS) technology and test potential benefits of using the technology on NVIDIA® Jetson with the FRAMOS sensor module ecosystem.

Built around SONY and PROPHESEE’s cutting-edge EVS technology, this developer kit simplifies the prototyping process and helps companies reduce time to market.

Event-based Vision Sensing (EVS)

Unlike conventional sensors that transmit all visible data in successive frames, the EVS sensor captures only the changed pixel data, specifically luminance changes. Each event package includes crucial information: pixel coordinates, timestamp, and polarity, resulting in efficient bandwidth usage.

By reducing the transmission of redundant data, this technology lowers energy consumption and optimizes processing capacities, reducing the cost of vision solutions.

EVS sensors provide high-speed and low-latency data output. They give outstanding results in monitoring vibration and movement in low-light conditions.

The FSM-IMX636 Development Kit consists of an IMX636 Event-based Vision Sensor board with a lens, all necessary adapters, accessories, and drivers, crafted into a comprehensive, easy-to-integrate solution for testing EVS in embedded applications systems on NVIDIA® Jetson AGX Xavier and NVIDIA® Jetson AGX Orin platforms.

The PROPHESEE Metavision® Intelligence Suite provides machine learning-supported event data processing, analytics, and visualization modules.

FRAMOS’ new Development Kit is an affordable, simple to use, and intelligent platform for testing, prototpying, and faster launch of diverse EVS-based applications in in a wide range of fields, including industrial automation, medical field, automotive and mobility, and IoT and monitoring.

For more information, visit this link.

About FRAMOS

FRAMOS® is the leading global expert in vision systems, dedicated to innovation and excellence in enabling devices to see and think.

For more than 40 years, the company has supported clients worldwide in building pioneering vision systems.

Throughout all phases of vision system development, from hardware and software solutions to component selection, customization, consulting, prototyping, and mass production, companies worldwide rely on FRAMOS proven expertise.

Thanks to its engineering excellence and a large base of loyal clients, the company operates successfully on three continents.

Over 180 experts working in Munich, Ottawa, Zagreb, and Čakovec offices commit themselves to developing cutting-edge imaging solutions for various applications across various industries.

For more information, please visit www.framos.com or follow us on LinkedIn, Facebook, Instagram or Twitter.

The post FRAMOS Launches Event-based Vision Sensing (EVS) Development Kit appeared first on Edge AI and Vision Alliance.

“Optimizing Image Quality and Stereo Depth at the Edge,” a Presentation from John Deere

Brian Dipert — Wed, 04 Oct 2023 08:00:49 +0000

Travis Davis, Delivery Manager in the Automation and Autonomy Core, and Tarik Loukili, Technical Lead for Automation and Autonomy Applications, both of John Deere, present the “Reinventing Smart Cities with Computer Vision” tutorial at the May 2023 Embedded Vision Summit. John Deere uses machine learning and computer vision (including stereo…

“Optimizing Image Quality and Stereo Depth at the Edge,” a Presentation from John Deere

Register or sign in to access this content.

Registration is free and takes less than one minute. Click here to register and get full access to the Edge AI and Vision Alliance's valuable content.

The post “Optimizing Image Quality and Stereo Depth at the Edge,” a Presentation from John Deere appeared first on Edge AI and Vision Alliance.