| Title: | LLM-Powered Fuzzy Join |
|---|---|
| Description: | Resolves ambiguous links between data.frames using large language models (LLMs). Supports matching across spelling variations, translations, and differing levels of precision. |
| Authors: | Yifan LIU [aut, cre] |
| Maintainer: | Yifan LIU <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.3.0 |
| Built: | 2026-06-17 09:23:34 UTC |
| Source: | https://github.com/evanliu3594/llmjoin |
Build a fuzzy-join joint data.frame via LLM
build_joint(x, y, key1, key2, ...)build_joint(x, y, key1, key2, ...)
x |
a data.frame to be joined on the lhs. |
y |
a data.frame to be joined on the rhs. |
key1 |
string, name of the key column of data.frame x waiting for pairing. |
key2 |
string, name of the key column of data.frame y waiting for pairing. |
... |
extra params passed to chat_llm() |
a 2-column data.frame mapping values from key1 to key2.
build_joint( x = data.frame(x = c("01","02","04")), y = data.frame(y = c("January","Feb","May")), key1 = "x", key2 = "y" )build_joint( x = data.frame(x = c("01","02","04")), y = data.frame(y = c("January","Feb","May")), key1 = "x", key2 = "y" )
This function sends a message to the LLM model and retrieves the result.
chat_llm( .message, .model = NULL, .temperature = 0, .max_tokens = 30000, .timeout = 300, .verbose = getOption("llmjoin.verbose", FALSE) )chat_llm( .message, .model = NULL, .temperature = 0, .max_tokens = 30000, .timeout = 300, .verbose = getOption("llmjoin.verbose", FALSE) )
.message |
the message to send. |
.model |
character, LLM model to use. By default NULL (uses config value). |
.temperature |
OpenAI style randomness control (0~1), by default 0. |
.max_tokens |
Max tokens to spend. |
.timeout |
Max seconds to communicate with LLM. |
.verbose |
logical, print progress messages. Default |
A character string with the LLM's response text.
chat_llm("tell a joke.")chat_llm("tell a joke.")
Generate a prompt to guide the LLM in generating a joint for data frame joining, leveraging the two key columns from the tables to be connected. As of 2025/04/10, DeepSeek R1 and gpt-4.1-mini showed the best result; other LLMs might fabricate non-existent data in the result.
joint_prompt(x, y)joint_prompt(x, y)
x |
1-column data.frame or vector of characters, left hand side of the join |
y |
1-column data.frame or vector of characters, right hand side of the join |
A character string containing the matching prompt.
joint_prompt( data.frame(x = c("01","02","04")), data.frame(y = c("January","Feb","May")) )joint_prompt( data.frame(x = c("01","02","04")), data.frame(y = c("January","Feb","May")) )
Fuzzy join with LLM
llm_join(x, y, key1, key2, ...)llm_join(x, y, key1, key2, ...)
x |
a data.frame to be joined on the lhs. |
y |
a data.frame to be joined on the rhs. |
key1 |
string, name of the key column of data.frame x waiting for pairing. |
key2 |
string, name of the key column of data.frame y waiting for pairing. |
... |
extra params passed to chat_llm() |
the fuzzy-joined data.frame
x <- data.frame(id = c("01", "02", "04"), value = c(10, 20, 40)) y <- data.frame(month = c("January", "Feb", "May"), amount = c(100, 200, 400)) llm_join(x, y, key1 = "id", key2 = "month")x <- data.frame(id = c("01", "02", "04"), value = c(10, 20, 40)) y <- data.frame(month = c("January", "Feb", "May"), amount = c(100, 200, 400)) llm_join(x, y, key1 = "id", key2 = "month")
Strips markdown fences, extracts the longest consecutive block of comma-separated lines, ensures a header row matching 'key1,key2' is present, and parses the CSV into a 2-column data.frame.
parse_joint(llm_response, key1, key2)parse_joint(llm_response, key1, key2)
llm_response |
character, raw response from the LLM. |
key1 |
string, name of the lhs key column. |
key2 |
string, name of the rhs key column. |
a 2-column data.frame mapping values from key1 to key2.
parse_joint("01,January\n02,Feb\n04,May", key1 = "id", key2 = "month")parse_joint("01,January\n02,Feb\n04,May", key1 = "id", key2 = "month")
Set up your LLM service with native support for OpenAI, Claude (Anthropic), and Gemini (via OpenAI-compatible endpoint). For custom endpoints like Ollama, proxies, DeepSeek, Kimi, and others, use provider = "openai" along with your custom URL to connect through the compactible API interface. All information is stored strictly locally in your system configuration and is never uploaded or shared.
set_llm(provider = "openai", url = NULL, key = NULL, model = NULL)set_llm(provider = "openai", url = NULL, key = NULL, model = NULL)
provider |
character, LLM provider. One of "openai", "claude", "gemini". Default "openai". |
url |
url to your LLM provider endpoint. If NULL, auto-set based on provider. |
key |
api-key of your service. |
model |
character, model name. If NULL, auto-set from provider default. |
NULL invisibly. Called for side effect of writing the config file.
set_llm(provider = "openai", key = "<your-openai-api-key>", model = "gpt-5.4-mini")set_llm(provider = "openai", key = "<your-openai-api-key>", model = "gpt-5.4-mini")
Convert a data frame to a markdown table
tbl2md(tbl, nm = NULL)tbl2md(tbl, nm = NULL)
tbl |
a data.frame object or a vector. |
nm |
character, only used if 'tbl' is a vector. |
markdown style table string lines
tbl2md(iris)tbl2md(iris)